--- Log opened Tue Jul 30 00:00:49 2013 | ||
stekern | aargh, the bits in the IMMUPR register are the wrong way around in the arch spec! | 02:36 |
---|---|---|
stekern | DMMUPR i meant | 02:37 |
stekern | actually, it's in the translate register that they are the wrong way around... | 02:38 |
stekern | so, DMMUPR is completely useless now... | 02:41 |
stekern | well, perhaps not completely useless, they are easily swapped in hardware, and in software emulation of them I can just swap them in the definition | 03:22 |
stekern | ok, so Linux at least boots with the PTE proposed in the ml discussion | 06:58 |
jonibo | stekern: available? | 07:59 |
stekern | yes, but I'm leaving for lunch in a minute | 08:00 |
jonibo | ok | 08:00 |
jonibo | are you following the arch spec 8 + 11 bits indexing when walking the page tables_ | 08:01 |
jonibo | ? | 08:01 |
jonibo | this results in less than optimal memory usage, right? | 08:01 |
jonibo | if you've got time to discuss after lunch, ping me | 08:01 |
stekern | right now I'm basically doing exactly what's done in the tlb miss vectors | 08:01 |
jonibo | otherwise we can do this by email | 08:01 |
jonibo | ok, i'll take a look what's going on there | 08:02 |
jonibo | the arch spec for TLB/page table handling needs a bit of a cleanup | 08:02 |
stekern | "a bit" =P | 08:25 |
stekern | jonibo: if I you mean using bits 31:24 from VPN as the offset relative to the base address and bits 23:13 from VPN as the offset from the pte table base with 8 + 11 bits indexing, then yes | 08:32 |
stekern | that's what is currently done in the kernel (and mor1kx) | 08:32 |
jonibo | ok, got it | 08:32 |
jonibo | what's with the PL1 bit? | 08:33 |
jonibo | level 1/2 indicator | 08:33 |
stekern | dunno | 08:33 |
jonibo | what i'd really like to do is be able to mix 1-level and 2-level page tables | 08:33 |
stekern | it's probably a mess of inbetween decisions how they should implement things | 08:33 |
stekern | couldn't that be done if the L flag would be used? | 08:34 |
jonibo | the kernel maps to 0xc000000000 + 16MB | 08:34 |
jonibo | so if we could set L to 1 for the PGD for address 0xc0000000 we have a huge page that maps linearly | 08:34 |
jonibo | and L sets to 0 for all other PGD entries | 08:34 |
jonibo | in SW it would be possible to dedicate a TLB entry to the kernel mapping so you never take a TLB exception for kernel pages... | 08:35 |
jonibo | ...but for the HW implementation, that doesn't seem possible | 08:35 |
jonibo | but if it's just 1 level then it's reasonably cheap anyway | 08:36 |
stekern | could it be done if support for CID existed? (if only for the MMU) | 08:37 |
jonibo | no, doesn't need CID | 08:37 |
jonibo | what's the point of the L bit? | 08:37 |
jonibo | the HW implementation has to have special knowledge of the two-level structure in order to handle 8/11 bit indexes properly anyway | 08:38 |
jonibo | (aside: the HW TLB mapper has to write out the A/D bits to the PTE's when replacing an entry...) | 08:39 |
jonibo | (that aside just so I don't forget to mention it later) | 08:39 |
stekern | my point with the CID was that, could you use that to lock down TLB's for the kernel. Assign one CID for the kernel | 08:41 |
jonibo | problem with CID is that the kernel often wants to access userspace context and if that's in a different CID then you're kind of hooped | 08:41 |
stekern | hmm, ok | 08:42 |
jonibo | CID might work for process context separation, but not kernel/userspace separations | 08:42 |
jonibo | and then there's the ATB that's not implemented anywhere... it allows 16MB pages, but why both with that when 1-level page directories buy you that | 08:43 |
jonibo | how much can one change the arch spec here? | 08:43 |
jonibo | xTLBWyMRz regs have a PL1 bit.... I don't get what that's for | 08:44 |
jonibo | and all the match/translate regs use 22bit page frames... is that an error? you said the implementations uses the high 19 bits of these? why the high bits? | 08:46 |
stekern | me neither, but I didn't get your point with the "hw implementation has to have special knowledge of the two-level structure to handle 8/11 bit..." | 08:47 |
jonibo | it needs to know to index the first level by 8 bits and the next level by 11 bits | 08:47 |
jonibo | so it treats level 1 and level 2 differently | 08:48 |
stekern | yes, ok, I agree on that | 08:48 |
jonibo | anyway, the point of all this is that i'd like to pre-create the kernel page directory so that the HW walker can be used as soon as you enable MMU's | 08:49 |
jonibo | it can be done with 2 pages... the initial PGD (swapper_pg_dir) filled with a single entry pointing at the PMD for the kernel space (16 MB) | 08:50 |
jonibo | alternatively, if we can set L to 1 in the PGD and get a 24-bit "page" for the kernel, even better as then we don't need to create the PMD page at all | 08:50 |
jonibo | (but that last bit is not necessarily according to spec) | 08:51 |
jonibo | this would allow us to get rid of the boot_xtlb_miss handlers altogether, too, for the SW implementation | 08:52 |
stekern | I was just about to write, that if you assume that the hw page wallker _has_ knowledge about the 8/11 split, then you'd be able to that | 08:52 |
stekern | set the L to 1 in PGD and use that for a 24-bit page | 08:52 |
jonibo | that's what i'd like to do, yes | 08:52 |
jonibo | but then what's the point of the 16MB pages in the ATB? | 08:53 |
jonibo | so if do that, the L doesn't really mean "last" (though it does), it really means 16MB page or 8kb page | 08:55 |
jonibo | i.e. normal page/huge page | 08:55 |
stekern | umm, but even if you have your page table set up like that, the tlb entries will still be for 8kb pages | 08:55 |
jonibo | right... of course... sorry | 08:56 |
jonibo | i think i was thinking there was equivalence in the pgd and pmd entries, but that's not the case | 08:57 |
jonibo | or is it... this is confusing... they are equivalent!?! | 09:01 |
stekern | (22bit page frames) I think that's an error, at least my implementation in mor1kx only use bits 31:13 | 09:02 |
jonibo | ok, if or1200 does the same we should modify the arch spec... | 09:02 |
stekern | what does equivalent mean in that context? pmd is folded into pgd, so the pte lookup is (pgd_base + offset)->(pte_base + offset)->pte | 09:04 |
jonibo | contain same "data" | 09:04 |
jonibo | data format | 09:04 |
jonibo | anyway, forget that, I think I was confused | 09:04 |
jonibo | so the TLB entries will be for 8kb pages... can that be changed? | 09:05 |
jonibo | maybe that's what the PL1 bit is for? | 09:05 |
stekern | believe me, I've been (and still am) plenty confused while trying to grasp how the page table works ;) | 09:05 |
jonibo | for the match register, if "page level is 1" (PL1=1), then you've got a 16MB page | 09:06 |
stekern | but isn't that the whole point with the ATLBs? | 09:06 |
jonibo | i think the TLB and ATB design is convoluted | 09:06 |
jonibo | i'd say the TLB part of the spec contains enough to do huge pages | 09:07 |
jonibo | and the ATB is only need for 32GB pages, but who on earth needs those???????? | 09:07 |
stekern | heh | 09:07 |
jonibo | if you need 32GB pages then you might as well turn off the MMU | 09:07 |
jonibo | i'd rip the ATB out of the spec completely | 09:08 |
jonibo | anyway, i'd change the name of PL1 to HP (huge page) | 09:08 |
jonibo | so if level-1 L=1, then PL1=1 (16MB page) otherwise you've got to hit level-2 and then set PL1=0 (8kb page)... | 09:10 |
stekern | I agree, it looks to me too that you could do huge pages fine with only the tlb | 09:10 |
stekern | I realised another problem with the arch spec pte too, it has no PRESENT/VALID bit... | 09:11 |
jonibo | given that L and PL1 exist, I think somebody has already thought a bit about of this... the names suck but that can be changed | 09:11 |
stekern | the hw walker needs to know that | 09:12 |
jonibo | yeah, PRESENT/VALID are needed but can probably be squeezed into bits 12-10 somewhere, right? | 09:12 |
jonibo | (in the PTE) | 09:12 |
jonibo | but for that we need to make sure that everybody uses bits 31-13 for the page frame | 09:13 |
stekern | (PRESENT in 12-10) yes, that's not a problem, just that we need add that to the arch spec | 09:14 |
jonibo | right... which is doable iff the or1200 also uses bits 31-13 for the frame... what if it uses bits 28-10? | 09:15 |
jonibo | i'm not sure we need a valid bit... | 09:15 |
jonibo | in the PTE, I mean... isn't that a TLB detail? | 09:16 |
stekern | what should the hw walker do when there's a pte without the present bit (if it doesn't know about it) | 09:17 |
stekern | as I have understood it, all other bits can basically be "random" when present is not set. | 09:18 |
jonibo | it needs to know about it... you're right, I think the rest is garbage to help the swapper find the page when it's not set | 09:19 |
stekern | anyways, I mentioned it before you logged on, I tested redoing the PTE layout as per my suggestion in the mail yesterday and rewrote the tlb miss handlers for it this morning | 09:21 |
stekern | it boots at least =P | 09:22 |
jonibo | very good | 09:22 |
stekern | I managed to shave of one of the register saves in the exception context switch while I rewrote the tlb miss handlers | 09:22 |
stekern | *shave off | 09:22 |
jonibo | also good! | 09:23 |
jonibo | which one, r12? | 09:23 |
stekern | no, the tlb miss handlers only save(d) r2,r3,r4,r5,r6 | 09:23 |
jonibo | oh yeah, right, that one's special | 09:24 |
stekern | by reorganizing slightly the r6 uses could be removed | 09:24 |
jonibo | cool | 09:24 |
jonibo | i think that stuff could be redone in C, really | 09:25 |
stekern | I'm inclined to agree | 09:25 |
jonibo | it's really just a matter of making sure the kernel pages are mapped into the TLB and then turn on the MMU again and run the C function | 09:26 |
jonibo | that's why i'd really liek the 16MB kernel page... it's quick to map it if it's not there | 09:26 |
jonibo | of course, we might get into trouble if we end up overwriting the TLB entry that the kernel is using... that's really why a dedicated "kernel TLB entry" would be nice... can be done in SW if you dedice entry 0 in the TLB to the huge kernel page | 09:28 |
jonibo | dedicate | 09:28 |
stekern | (garbage to help the swapper) so is it cool to have the present bit wherever in the pte? | 09:31 |
jonibo | i was wondering the same thing... I don't know but I think it should be ok | 09:33 |
jonibo | other arches have it "anywhere" so it's certainly fine | 09:35 |
stekern | ok, good | 09:35 |
jonibo | are you doing the D and A bits write-back when swapping out a TLB entry? | 09:36 |
jonibo | those bits are supposed to be sync'ed to the PTE when the TLB entry goes away | 09:36 |
stekern | no, not currently | 09:36 |
stekern | shouldn't the tlb miss handler do that too then? | 09:36 |
stekern | and isn't the dirty bit supposed to be set when something writes to the page? | 09:37 |
jonibo | the dirty bit should be set when there's a write access to the page (by the CPU) | 09:38 |
jonibo | yes, the TLB miss handler should do that... | 09:38 |
stekern | (which is pretty much impossible to do in sw, unless the miss itself was a write access) | 09:38 |
stekern | but if you do that properly, you can just write-through the tlb, no? | 09:40 |
stekern | in hw I mean? | 09:40 |
stekern | -? | 09:40 |
jonibo | then the TLB would need to look exactly like the PTE...??? | 09:41 |
stekern | umm, yeah.... | 09:41 |
stekern | or just do rmw on the pte on the first access | 09:41 |
jonibo | i wonder if the kernel even uses those bits...??? | 09:42 |
jonibo | maybe it's moot | 09:42 |
jonibo | since it works today, it would seem so :) | 09:42 |
jonibo | will need to dig into that too | 09:42 |
stekern | well, isn't those just performance tweaks? | 09:42 |
stekern | aren't | 09:43 |
jonibo | not necessarily, but in this case, perhaps yes | 09:43 |
jonibo | how does the kernel know a page has bee nwritten to? | 09:43 |
jonibo | it'd have to set it RO, wait for the page fault, set the page RW and set the D bit | 09:43 |
jonibo | seems like a fair bit of work to do for the first page write | 09:44 |
jonibo | is PL1 not implemented on mor1kx/or1200? that must be a bug...??? | 09:45 |
jonibo | (my huge page bit) | 09:45 |
stekern | no, not implemented in neither (or well, you can write the bit in mor1kx, but it's not interpreted by anything inside mor1kx) | 09:47 |
jonibo | interesting | 09:48 |
jonibo | do you agree that's a bug? | 09:48 |
stekern | I'm not sure what it should do, so I can't agree ;) | 09:49 |
jonibo | HEH | 09:49 |
jonibo | i was just going to check what or1ksim does | 09:49 |
jonibo | or1ksim seem to do nothing with it | 09:50 |
stekern | *but*, an interesting fact is that the defines for xMMUCR in spr_defs.h that was wrong, mentioned "Level1 and Level2 page size" and "Vaddr and Paddr widths" | 09:52 |
jonibo | hmmm, interesting | 09:53 |
jonibo | "A PTE translates a virtual memory area into a physical memory area. How much | 09:59 |
jonibo | virtual memory is translated depends on which level the PTE resides. PTEs are either in | 09:59 |
jonibo | page directories with L bit zeroed or in page tables with L bit set. PTEs in page | 10:00 |
jonibo | directories point to next level page directory or to final page table that containts PTEs for | 10:00 |
jonibo | actual address translation." | 10:00 |
jonibo | that's from the arch spec | 10:00 |
jonibo | I think that somehow replace the old DMMUCR stuff from spr-defs | 10:00 |
jonibo | it's now _defined_ to 8 + 11 bits | 10:00 |
jonibo | _and_ the L bit decides whether it's a huge page or not, _and_ the top-level page directory can contain mixed huge and normal pages | 10:01 |
jonibo | so the TLB needs to know this too, hence PL1 | 10:02 |
jonibo | PL1 isn't optional so that lack of it is a bug | 10:02 |
jonibo | QED | 10:02 |
jonibo | kthxbye :) | 10:02 |
stekern | :) | 10:02 |
jonibo | PTBP in DMMUCR is 22bits, again that's got to be wrong | 10:03 |
jonibo | it must be 19 bits, no? | 10:03 |
stekern | but in a software tlb reload environment, isn't that the job of the software to set that bit then? | 10:04 |
jonibo | yes, absolutely | 10:04 |
jonibo | if L = 1 at level 1 then you need to set that bit | 10:04 |
jonibo | i'm not say that the SW implementation is correct... I'm just trying to figure out how we can unify the SW and HW implementations as best as possible and how to make it all sane with respect to the arch spec | 10:05 |
stekern | yeah I know | 10:05 |
jonibo | the SW implementation doesn't use PL1 at all either, but then again it's meaningless if the HW doesn't implement it | 10:05 |
jonibo | but how can we make this change backwards-compatible with the or1200 which I presume is abandoned? | 10:06 |
jonibo | kernel config option to special-case OR1200's "broken" behaviour | 10:07 |
jonibo | ? | 10:07 |
stekern | there is already that kind of kernel config | 10:07 |
jonibo | yeah | 10:07 |
jonibo | I supposed we could actually test for it by mapping a huge page and checking if an access to the second normal page within it generates an exception | 10:09 |
jonibo | (it should't) | 10:09 |
jonibo | why do the protection registers (PPI index) map from 1-7, why not 0-7? | 10:12 |
stekern | (PTBP 22 bits) hmm, I'm not sure, if you have 8-bits from VPN, you'll need 22 bits "to fill the rest of the bits" | 10:13 |
stekern | but I guess that's always page aligned anyway | 10:13 |
jonibo | ? | 10:14 |
jonibo | the page tables/directories will always be complete pages (8kb), so using 19 bits is sufficient | 10:14 |
stekern | but is that OS independent? | 10:16 |
jonibo | isn't it? | 10:16 |
stekern | I'm the one asking | 10:16 |
jonibo | i'd say it is | 10:16 |
stekern | but that's, as you said, waste of memory | 10:17 |
jonibo | especially given that we enforce 8 + 11 bit indexing | 10:17 |
jonibo | no, not in this case | 10:17 |
jonibo | you can't realistically allocate less than a page at a time anyway | 10:18 |
stekern | hmm, yeah | 10:20 |
stekern | (PPI) oh... no... | 10:21 |
stekern | ok, so we have a "VALID/PRESENT" bit in the arch spec, PPI = 0... | 10:21 |
stekern | so the whole idea with X W U bits mapping nicely into PPI fail then | 10:24 |
jonibo | i don't see why... | 10:24 |
stekern | because X=0 & W=0 & U=0 was supposed to imply R | 10:25 |
jonibo | right... I see the problem | 10:25 |
jonibo | again, the arch spec needs to be sane there | 10:25 |
jonibo | we need to change it so that there are 8 sets of protection... what's the point of not using set 0 | 10:26 |
jonibo | PPI=0 must be a vlaid value | 10:26 |
jonibo | PPI=0 = invalid... hmmm | 10:27 |
jonibo | so it might be even trickier yet, or better... | 10:35 |
stekern | so to do that properly according to the arch spec, you'd have to map the protection into numbers, not bitmasks | 10:35 |
jonibo | the ITLB and DTLB uses separate protection regs | 10:35 |
stekern | yes | 10:35 |
jonibo | but they use the same page tables | 10:36 |
stekern | yes | 10:36 |
jonibo | so the meaning of the PPI field varies depending on whether the instruction tlb or DTLB are in play | 10:36 |
stekern | in essence, yes | 10:36 |
jonibo | so we really only have 2 bits relevant for DTLB... Writable/User. And two bits relevant for ITLB: Exec/User | 10:39 |
stekern | but you can of course look at it in the way of IMMUPR[PPI] | DMMUPR[PPI] | 10:39 |
jonibo | wait... for itlb it's just 1 option... user... it's always executable | 10:42 |
jonibo | no it's not | 10:42 |
stekern | well, in the current itlb miss handler it is... but that's wrong | 10:43 |
jonibo | for the itlb it's "invalid" if it's not executable | 10:43 |
jonibo | it makes no sense to load a page into the ITLB if it's not executable | 10:44 |
stekern | true | 10:47 |
stekern | are the PAGE_XXX defines used directly in generic kernel code? or are the pte_xxx accessor functions used for that? | 10:52 |
jonibo | pte_ accessors | 10:53 |
jonibo | I think | 10:54 |
stekern | I'm just wondering how big of a deal it would be to do the PPI field properly | 10:54 |
jonibo | how properly? | 10:55 |
jonibo | do you mean "skip the protection registers"? | 10:56 |
jonibo | that would probably be better | 10:56 |
jonibo | I think part of the question is "has anybody already implemented something around this before"... if not, we don't need to worry so much about compatibility | 10:57 |
jonibo | we could just introduce a new, parallel MMU system with it's own registers, too and leave the old stuff as it is... just a thought | 10:58 |
jonibo | i need to get some lunch... back in a bit | 10:58 |
stekern | I meant, like PAGE_PRESENT=1, PAGE_READ=2, PAGE_WRITE=3 etc | 10:59 |
jonibo | what about this for PPI: bit 2 = writable, bit 1 = user, bit 0 = valid/readable/exec | 11:26 |
jonibo | for dtlb valid = readable | 11:26 |
jonibo | for itlb valid = executable | 11:26 |
jonibo | for dtlb valid = readable + maybe user + maybe writable (really only four cases); invalid = 0 | 11:28 |
jonibo | for itlb, valid = executable + maybe user (really only 2 cases), invalid = 0 | 11:28 |
jonibo | so this way we can map the two different uses onto one bitmask... I think i works | 11:29 |
stekern | hmm, yeah, maybe, it kind of feels right at least ;) | 11:33 |
stekern | the only thing that feels a bit off is that READ == EXECUTE, but perhaps that's not an issue in practice | 11:34 |
stekern | I'll play with that in the WIP pte rework I have in my working tree | 11:36 |
stekern | tonight or latest tomorrow morning | 11:36 |
jonibo | read == execute... hmm, yeah, that's kind of ugly | 11:36 |
jonibo | but we'd almost need separate page tables for data and instructions to get around that | 11:37 |
stekern | mmm | 11:41 |
jonibo | but really, I think the best solution would be to redefine the meaning of the PPI field to make it explcity W/U/X and put the Valid bit elsewhere | 11:43 |
jonibo | then drop the protection regs altogetehr | 11:44 |
jonibo | the protection regs are overdimensioned anyway | 11:44 |
jonibo | IMMUPR provides 7 sets with 2 bits each! | 11:45 |
jonibo | DMMUPR provides 7 sets with 4 bit each, but many of those combinations really don't make sense | 11:45 |
jonibo | ok, how about this: | 11:58 |
jonibo | what if we made the page table indices 11 + 8 bits instead of 8 + 11... that would get our huge page size down to 2 MB which is more reasonable | 11:58 |
jonibo | (i'd prefer 1 MB, even, if possible) | 11:58 |
jonibo | unfortunately, we'd end up with more second level page tables and they'd not be as full... bad tradeoff | 11:59 |
jonibo | or: the top level page directory is sparesely populate (as it's just 8 bit indexed)... if the L bit is set, it could mean that the PTE is found between the 8 bit entries in the top-level directory and is 11 bit indexed... that PTE would point to a 21 bit huge page (2MB) | 12:04 |
jonibo | that's reasonably elegant | 12:05 |
jonibo | stekern: thanks for the discussion, I tried to summarize the above in an email and sent it to the list(s)... let's continue the discussion there in case it leads to something useful/concrete | 12:56 |
stekern | jonibo: agreed, and thanks for your input too, it helped me a lot! | 13:24 |
stekern | and I'm leaning towards the W/U/X + PRESENT bit, I've got a feeling that !(PPI) is going to be messy to have as PRESENT, but I still want to investigate that a bit more | 13:26 |
--- Log closed Wed Jul 31 00:00:50 2013 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!