--- Log opened Fri Aug 02 00:00:53 2013 | ||
olofk | I finally pushed some wishbone interconnect utilities to orpsocv3. With test benches!!! :) | 00:14 |
---|---|---|
stekern | olofk: nice! | 05:39 |
stekern | implementing PL1 isn't as easy as it first seemed | 07:05 |
stekern | at least not in hw | 07:09 |
stekern | since you need to know if the pl1 bit is set before you've read the pl1 bit | 07:10 |
jonibo | stekern: i'm not seeing the problem there... why do you need to know if it's set? | 07:23 |
stekern | because you need to know which bits in the address that is being accessed you should use as index to the tlbs, don't you? | 07:25 |
stekern | there are of course ways to solve that: | 07:28 |
jonibo | well, really you'd to index the high 8 bits, check PL1 and then index 19 bits | 07:29 |
stekern | 1) assume PL1 is set and reread if not (*very* bad idea) | 07:29 |
jonibo | it's all parallelizable, right? | 07:29 |
jonibo | you'd search 8 bits and 19 bits at the same time and use the 8 bit value if PL1 is set | 07:29 |
stekern | 2) speculatively read with both indexes and let the ones with PL1 set overrule | 07:30 |
jonibo | yes, 2) is what I tried to say above | 07:30 |
stekern | but it's messy, since the second port of the RAM is used for SPR accesses | 07:30 |
jonibo | right, I see | 07:31 |
stekern | 3) cache which tlb's have PL1 set | 07:31 |
jonibo | heh, 3) amounts to having ATB's | 07:32 |
stekern | I think 2 is still the right option, and it's mostly messy for the itlb, since the dtlb spr accesses are exclusive to memory accesses | 07:32 |
stekern | (3 = ATB) I know, I wonder if they are a product of a similar conversation as we are having now =P | 07:33 |
jonibo | yeah, that thought crossed my mind, too | 07:35 |
stekern | it turned out to be simpler to just use the X/U/W bits straight in hw, rather than use a lookup table | 07:38 |
jonibo | good to hear... it seemed to me that that should be the case | 07:39 |
jonibo | but the PL1 thing is tricky | 07:39 |
jonibo | and it gets us 16MB pages that are marginally useful (really a bit too big) | 07:39 |
jonibo | how is the TLB implemented... memory? | 07:41 |
jonibo | block ram? | 07:42 |
stekern | yes' | 07:43 |
stekern | https://github.com/openrisc/mor1kx/blob/master/rtl/verilog/mor1kx_immu.v#L320 | 07:44 |
stekern | two of them actually, one for match regs and one for trans regs | 07:45 |
stekern | well, in mor1kx at least | 07:45 |
stekern | iirc they are combined to one in or1200 | 07:46 |
jonibo | hmm... i'm not sure I understand how this works | 07:53 |
jonibo | does it _iterate_ over the match registers, or does it try to match against all 128 in parallel? | 07:54 |
stekern | no, it reads the match register with the index and checks if it matches the current address | 07:55 |
jonibo | ok... | 07:56 |
jonibo | that makes sense | 07:56 |
jonibo | that's what the kernel does, too | 07:56 |
jonibo | the TLB index for the 8-bit indexed virtual address is always 0, right? | 07:57 |
stekern | if you'd have multiple way tlbs you'd have to iterate over the ways (read all of them in parallell) | 07:57 |
jonibo | yup | 07:58 |
jonibo | ((VIRTADDRESS&0xff000000)>>13)mod128 = 0 | 08:00 |
jonibo | so you can only ever have PL1 set at index 0 | 08:00 |
stekern | not sure what you meant by that, but the max indexing from the 19 bit is virt_addr[20:13] | 08:00 |
stekern | so, yes | 08:01 |
stekern | (I think) | 08:01 |
jonibo | yeah, that's right | 08:01 |
stekern | so 3) perhaps isn't such a bad idea after all | 08:01 |
stekern | and doesn't need to amount to ATB, right? | 08:01 |
jonibo | ...yes... | 08:01 |
stekern | or am I missthinking this? | 08:02 |
jonibo | but, it's kind of hokey | 08:02 |
jonibo | there can only ever be 1 huge page in the TLB | 08:02 |
jonibo | if you use _only_ huge pages you end up never using 127 of the 128 slots in the TLB | 08:02 |
jonibo | it's not strictly necessary to "pre-index" the TLB as we do... it's just a simplification | 08:03 |
jonibo | really you'd want an LRU over the 128 slots | 08:03 |
stekern | umm, isn't the LRU for determine which way should go? | 08:03 |
jonibo | you'd want to say "TLB, cache this virt to phys mapping for me with these flags" and the TLB caches the entry into a slot for you | 08:04 |
jonibo | the way LRU is a second LRU | 08:04 |
jonibo | i'm saying we should have 128 TR and MATCH regs... we should have _one_ and the CPU should figure out where to cache the entry based on an LRU over the 128 available slots | 08:04 |
jonibo | does this even make sense...??? | 08:05 |
jonibo | why would you need ways in that case? | 08:05 |
jonibo | as it stands now, if I copy data from address 0xXXX02000 to address 0xXXX12000, I get a TLB miss for ever read/write | 08:07 |
jonibo | umm, this address aren't quite right, but hopefull you get my point | 08:08 |
stekern | makes sense, but sounds awfully hard to implement an LRU over 128 entries | 08:08 |
stekern | yeah, I get your point | 08:08 |
jonibo | LRU over 128 entries... yeah, sounds "impossible" | 08:09 |
stekern | so using ways with a more modest amount of entries to LRU over is a trade off | 08:09 |
jonibo | yes | 08:10 |
jonibo | 64 entries w/ 2 ways is already a win and quite easy to implement | 08:10 |
jonibo | 4 ways is trickier, I think, as the LRU is a bit complicated | 08:10 |
stekern | yeah, and 128x2 isn't any harder (I'm running mor1kx with 128 entries) | 08:11 |
jonibo | just looking quickly at x86 for reference, they have effectively an ATB | 08:11 |
jonibo | 14 entries, fully associative for 2/4MB pages | 08:11 |
jonibo | and the main TLB is for 4kb pages | 08:12 |
stekern | yeah, that does solve the problem with "only 1 huge page" | 08:12 |
jonibo | again, it would be nice if the hardware picked the way to cache an entry into for you | 08:13 |
jonibo | based on the LRU bits | 08:13 |
stekern | you mean "it will be nice when" ;) | 08:13 |
jonibo | :) | 08:13 |
jonibo | the arch spec kind of mandates that you pick it yourself | 08:14 |
jonibo | xTLBWyMRz | 08:15 |
stekern | regarding 4-way LRU, Stefan Wallenotwits shared a neat trick for that: http://lists.openrisc.net/pipermail/openrisc/2013-July/001786.html | 08:16 |
jonibo | cool... let's doable then | 08:17 |
jonibo | looks | 08:18 |
stekern | "picking it yourself" <- not sure I understood what you meant by that | 08:19 |
jonibo | you need to specify the way to fill by writing to the appropriate MR/TR registers | 08:19 |
jonibo | there's a unique register for each way | 08:19 |
stekern | ah, ok, I get what you mean. You are thinking in the case of software reload | 08:20 |
jonibo | yes | 08:20 |
jonibo | I still think a single MR/TR pair is sufficient... the HW selects the way for you based on LRU bits and the HW selects the TLB slot to fill based on the VIRT page frame in the MR register | 08:21 |
stekern | yes, that's kind of silly... | 08:21 |
jonibo | you'd write the MR register, and when you write the TR register it triggers a TLB cache entry fill based on the contents of the two registers | 08:22 |
jonibo | if you _really_ need to control it yourself, you could have a third register with 7 bits for slot and 1/2 bits for way... but seems like overkill | 08:23 |
jonibo | today we have 2 x slots x ways TLB registers | 08:23 |
jonibo | with a 4-way TLB, we'd probably find that kernel pages wouldn't be bumped out of the TLB very often which would make it less interesting to pursue things like using a dedicate TLB entry for a huge kernel page that never gets flushed | 08:25 |
jonibo | so what's the way forward then...? multiple ways partially solves the problem solved by huge pages (not completely, but it makes things better). might be better to pursue that than PL1/ATB...? | 08:40 |
jonibo | the PL1 bit is starting to seem like utter nonsense that should be removed from the spec | 08:41 |
jonibo | as things stand now, it could only ever be set on "set 0" | 08:41 |
stekern | thinking about that a bit more, I'm not sure that's true, why can't it be set on other sets? | 08:48 |
jonibo | ((VIRTADDRESS&0xff000000)>>13)mod128 = 0 | 08:49 |
stekern | but that's for the 8kb page case, no? | 08:49 |
stekern | for pl1=1, you would use virt_addr[28:21] to index the tlbs | 08:50 |
stekern | (assuming 128 sets) | 08:50 |
jonibo | umm... | 08:51 |
stekern | for pl1=0 you would use virt_addr[20:13] | 08:51 |
jonibo | ok, sure, if you do it that way then I guess you'd get 128 huge pages too | 08:51 |
jonibo | if you do it this way then you don't even need to store the bottom 7 bits of the page frame because you get it from the TLB index automatically, right? | 08:54 |
jonibo | for pl=1 you'd use bits virt_addr[30:24] to find the TLB set and you'd only need to match the high bit in the entry's VPN field, so the VPN field could be 1 bit there | 08:58 |
jonibo | for pl=0, you'd only have to match the 19-7=12 highs of the VPN field | 08:59 |
jonibo | the VPN field only needs to be 12 bits, really | 08:59 |
jonibo | of course, this depends on how many sets the TLB has | 09:02 |
jonibo | for 64 sets you'd need 13 bits | 09:02 |
juliusb | olofk: nice one with the orconfig.org! | 09:04 |
stekern | yes, virt_addr[30:24], where did I get 28:21 from? | 09:34 |
stekern | and yeah, you could probably save a couple of bits by reusing the ones from vpn | 09:44 |
stekern | (otoh, atm mor1kx saves reserved bits to mem too, so you'd probably want to optimize those away first ;)) | 09:44 |
jonibo | VPN at 13 bits + PPN at 19 bits will all fit into 1 word | 09:53 |
jonibo | and then you could move the "flag/status/etc" bits to a separate area... does that buy you anything? | 09:53 |
jonibo | of if the VPN is 12 bits (which works for 128 entry TLB), then you can get the "PL1" flag in there as well | 09:54 |
stekern | if you're not going somewhere special with having words, block rams doesn't need to built up by 32-bit words | 09:56 |
jonibo | ok, then there may be real saving to be had | 09:57 |
jonibo | i'll be back in 30 minutes | 09:57 |
stekern | but having them seperated into translate and match helps with SPR/READ and write, so I'm inclined to keep that | 10:01 |
stekern | brushing away unused bits is just something I haven't got around to do | 10:02 |
stekern | *SPR read/write | 10:02 |
stekern | well, write anyway, read isn't much of a problem | 10:05 |
juliusb | alright - it really is official - I'm going to talk at OHS2013 - http://2013.oshwa.org/schedule/ | 10:22 |
juliusb | (only for 6 minutes but it still counts!) | 10:23 |
olofk | juliusb: I see that you're on the "Democratizing Knowledge" track. Considering what elitistic bastards we all are, I'm not sure they put your talk in the right place ;) | 14:31 |
juliusb | olofk: haha yes, very true, I'll be sure to ask the organisers to vet the audience beforehand and eject those who are not worthy | 17:39 |
olofk | juliusb: That's the spirit :) | 18:21 |
-!- Netsplit *.net <-> *.split quits: poke53282, olofk | 19:20 | |
-!- Netsplit over, joins: olofk | 19:23 | |
stekern | I think I've got muxing of spr rw and normal itlb accesses onto one port working now | 19:40 |
olofk | In the or1200-basic test, there is an access to memory address 0x12345678. Is that supposed to exist? Wouldn't that require an insane amount of memory to simulate? | 23:43 |
olofk | Or does it assume that the memory wraps? | 23:46 |
--- Log closed Sat Aug 03 00:00:55 2013 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!