jonibo|laptop | hi stekern | 11:44 |
---|---|---|
stekern | hello | 11:44 |
jonibo|laptop | figured it might be as easy to discuss here as by mail | 11:44 |
jonibo|laptop | the cache invalidate register, as far as I can tell, only works at startup today because it's a 1-way cache | 11:45 |
stekern | nah, it works because it's not spec-complient | 11:45 |
jonibo|laptop | how so? | 11:46 |
stekern | the spec says that _only_ the address that is written to CBIR should be invalidated, but or1200 wipse any tag matching | 11:47 |
jonibo|laptop | ok, that's broken | 11:47 |
jonibo|laptop | but it's a 1-way cache | 11:47 |
jonibo|laptop | so it's the same thing | 11:47 |
jonibo|laptop | it should match on the tag, find the way it corresponds to, and invalidate that cache-way only | 11:48 |
stekern | is it? I don't think so | 11:48 |
jonibo|laptop | but for a 1-way cache, _any_ tag will be the only way | 11:48 |
jonibo|laptop | or maybe not...??? let me think | 11:49 |
jonibo|laptop | actually,, you're right... it's _really_ broken | 11:50 |
stekern | I don't think 1-way or multi-way makes any different in this case | 11:50 |
jonibo|laptop | no, it should not invalidate the line if the tag doesn' tmatch | 11:50 |
jonibo|laptop | so, you're right | 11:50 |
jonibo|laptop | anyway, it should still be invalidated at reset, automatically | 11:50 |
jonibo|laptop | otherwise we need to loop over all the entire physical mmeory space | 11:50 |
stekern | but the problem is, to do it according to the spec, you'd have to loop through the whole memory to invalidate the whole cache | 11:51 |
jonibo|laptop | yeah, that's what you want to avoid doing | 11:51 |
jonibo|laptop | are you _for_ the invalidate at reset? i interpreted your mail as you being against it | 11:52 |
jonibo|laptop | automatic, I mean | 11:52 |
stekern | so a "invalidate entire cache" command would be needed for that | 11:52 |
jonibo|laptop | nah, that 's not needed... you don't want to both software with this at all | 11:52 |
jonibo|laptop | it should just be done automatically at reset | 11:53 |
jonibo|laptop | who wants a cache with unknown state at startup anyway? | 11:53 |
jonibo|laptop | you want the whole thing invalid | 11:53 |
stekern | I'm for it in theory, but it'll bloat the hardware | 11:53 |
jonibo|laptop | yeah, I guess so | 11:54 |
jonibo|laptop | today we use the fact that the or1200 is broken w.r.t. BIR to get a quick invalidate at startup | 11:55 |
jonibo|laptop | but it will be brutal if we need to do every tag individually | 11:56 |
stekern | how does other architectures handle the dilemma? | 11:56 |
jonibo|laptop | not sure... i'm pretty sure most caches come up invalidated at reset, though | 11:57 |
stekern | I know lm32 invalidates on startup, and it doesn't have a fine-grained invalidate flush. If you invalidate/flush, the whole cache goes | 11:57 |
jonibo|laptop | that's no good | 11:58 |
stekern | no | 11:58 |
jonibo|laptop | for DMA you want to be able to flush just the line you've modified | 11:58 |
stekern | but I wonder, is it so bad that you might get some collateral damage when you invalidate? (i.e, the way or1200 works) | 11:59 |
jonibo|laptop | not sure... i've got no numbers | 11:59 |
jonibo|laptop | but maybe | 12:00 |
jonibo|laptop | it's not a performance win in any case | 12:00 |
jonibo|laptop | for a multi-way cache, it's worse | 12:00 |
jonibo|laptop | anyway, i've got to run... i'm for hardware invalidate and spec-compliant BIR | 12:03 |
jonibo|laptop | i'll leave it at that | 12:03 |
jonibo|laptop | even if the hardware invalidate is implemented as an instruction loop in ROM that just iterates over all memory and invalidates the cache for it | 12:04 |
jonibo|laptop | (the entire memory space, that would be... :) ) | 12:04 |
juliusb_ | so this whole thing is solved by my proposal here: http://opencores.org/or1k/Architecture_Specification#Cache_Block_Invalidate_Behaviour_Clarification | 12:57 |
juliusb_ | basically we should just observe the set number of the address written to the BIR | 12:57 |
juliusb_ | and then invalidate that block | 12:57 |
juliusb_ | all done, simple, compatible with what we have now, and makes it simple for software to loop through and invalidate each set | 12:58 |
juliusb_ | regarding multiway - I've been asking this question for over a year now ;) | 12:58 |
juliusb_ | i say just invalidate all ways, simplest thing | 12:58 |
juliusb_ | as there's no way-specific block invalidate reg | 12:58 |
juliusb_ | but.... the way you guys just discussed is more sensible - only invalidate if the address is in cache | 12:59 |
stekern | yes, and jonibo has a point, you'll have the performance penalty | 13:00 |
juliusb_ | but my way is less of a burden on hardware implementation (i'm not a fan of putting in heaps of logic just for single use at reset) | 13:00 |
juliusb_ | performance penalty? | 13:00 |
juliusb_ | regarding all of these changes to OR1K - I'm more inclined to go with something which, despite maybe not being 100% the best approach, gives us maximum backward compatibility with minimum amount of work to adapt existing software and models to the new spec | 13:02 |
stekern | yes, since you've got the collateral damage of addresses you did not mean to invalidate | 13:02 |
juliusb_ | so, in this case, defining the behaviour of the cache BIR means we don't have any change anything in OR1200 or software | 13:02 |
juliusb_ | (but we're clarifying the behaviour for future developers and users) | 13:03 |
stekern | I wonder how much logic is actually needed to do the invalidate on reset, should only be a counter basically (and some control logic for the state in the fsm) | 13:03 |
stekern | is it only during reset that you'd want to invalidate the whole cache? | 13:05 |
juliusb_ | You'd also want to have it capable of being run by poking SPR bits too | 13:13 |
juliusb_ | but really, I'm against putting in this sort of stuff - I say, for the simplicity of the implementation, we should leave this to software | 13:14 |
juliusb_ | we're going to need it for all the memories, and that'll add up | 13:14 |
juliusb_ | overall transistor count, though, to do the invalidation - it's something I'd like to know which is smaller to do - the 8 instructions it takes in software to do it (8*32-bits = 256 FFs, essentially) or the hardware (probably a counter as wide as the number of addresses we need to clear and some muxing) | 13:16 |
juliusb_ | clearly the reset-by-FSM thing will be more power efficient and be quicker, but i'm still not sold on moving chunks of on-shot initialisation stuff into HW | 13:17 |
juliusb_ | one-shot | 13:17 |
stekern | I agree on that, on many FPGA targets you could parameterize it away though | 13:21 |
stekern | (if it is really only needed on reset) | 13:22 |
jonibo|laptop | juliusb_: I don't care much for your BIR clarification... I can accept that the or1200 is broken and does it that way, but let's not generalize that error | 15:26 |
jonibo|laptop | the reset case is special, let's ignore that for now | 15:26 |
jonibo|laptop | but at runtime you want a sane invalidation behaviour | 15:26 |
jonibo|laptop | if the line's not in cache, it's a no-op | 15:26 |
jonibo|laptop | and for multi-way, it's inelegant to trample over cache lines that may be in use by other processes | 15:27 |
jonibo|laptop | it's a conundrum, I know, and the or1200 implementation is fine as long as it's documented... but for the next generation we might be able tocome up with something more elegant... just not sure what that should be at this point | 15:27 |
juliusb_ | well, i say my desribed behaviour should be fine - it's more of a hardware-centric view, I'll accept that (basically use that BIR as a line invalidate interface) | 16:14 |
jonibo|laptop | it's fine for the reset case | 16:15 |
jonibo|laptop | it's less nice for regualar operation | 16:15 |
juliusb_ | so remove any idea of it being "intelligent" i guess | 16:15 |
jonibo|laptop | like I said, the or1200 does it that way... that's an implementation detail | 16:16 |
jonibo|laptop | that's fine | 16:16 |
jonibo|laptop | i don't like the idea of generalizing it, though | 16:16 |
juliusb_ | yes, but it was done that way for a reason | 16:16 |
jonibo|laptop | i understand that... it's less than optimal | 16:17 |
juliusb_ | and that reason is to avoid reset logic, and probably to get around a sloppily defined cache system | 16:17 |
juliusb_ | or rather, work with a sloppily defined cache system | 16:17 |
jonibo|laptop | it's not that sloppily defined... | 16:17 |
jonibo|laptop | in fact, it's pretty well-defined in the spec | 16:18 |
jonibo|laptop | the only problem is the reset case | 16:18 |
stekern | i was just about to say that | 16:18 |
jonibo|laptop | as it stands now SW is required to do _long_ loop to invalidate... that's fine per se | 16:18 |
jonibo|laptop | it just makes for a long startup time, but it's a one-time cost | 16:18 |
jonibo|laptop | and the or1200 solves that for the time being with a "less than optimal" solution... but it works | 16:19 |
jonibo|laptop | but I don't care to see that encoded in the spec | 16:19 |
jonibo|laptop | because I hope that someday somebody will come along and implement this properly... and then the spec shouldn't stand in their way | 16:19 |
juliusb_ | hmm, no, there's issues with what happens when the EA written into BIR isn't in the cache (it's not clear in the spec) | 16:19 |
jonibo|laptop | what? it's a no-op | 16:20 |
juliusb_ | spec says that EA is "EA that targets byte inside cache block | 16:20 |
jonibo|laptop | isn't that obvious | 16:20 |
juliusb_ | no it isn't | 16:20 |
juliusb_ | because that just says targets byte inside cache block | 16:20 |
jonibo|laptop | ok... it seems obvious to me | 16:20 |
juliusb_ | that says nothing about matching EA to the tag address and invalidating only in that case | 16:20 |
juliusb_ | that definition says to me it does an address mapping of the EA to the appropriate bytes in the cache | 16:20 |
jonibo|laptop | yeah, but think about it... what's the point of an invalidate? either then line is in cache and you want a fetch next time it's accessed, or it's not in cache in which case you get that anyway | 16:21 |
juliusb_ | yes, true, but you stil might have a case where you want to entirely clear the cache for a context switch or something | 16:21 |
juliusb_ | which is basically the reset cache | 16:21 |
jonibo|laptop | no, never... | 16:22 |
jonibo|laptop | the cache is physically tagged | 16:22 |
jonibo|laptop | you never clear the cache | 16:22 |
jonibo|laptop | the MMU makes sure that processes can't access others cached data | 16:22 |
juliusb_ | OK | 16:22 |
juliusb_ | yes of course, sounds good | 16:22 |
juliusb_ | so, as always, basically I'm arguging for the thing which is the simplest to implement in HW :) | 16:23 |
jonibo|laptop | i know exactly where you're coming from though... I had this conversation with myself last year! | 16:23 |
juliusb_ | current system is | 16:23 |
jonibo|laptop | yeah, and I'm arguing for a "correct" spec and "cutting corners in implementations is fine as long as you respect the spec" | 16:23 |
jonibo|laptop | ...which is what we have with the or1200 | 16:23 |
jonibo|laptop | almost | 16:23 |
juliusb_ | so i'm arguging to adapt the spec to do what OR1200 does now. To do it the way it should be done would require 1) some reset logic or some clear-all-cache-block-tags feature, and 2) something to read and compare the block tag when BIR is written to, to determine if it should be done or not | 16:23 |
jonibo|laptop | actually, not "almost"... it does respect the spec | 16:24 |
jonibo|laptop | yeah, if it were optimal, you'd have that | 16:24 |
juliusb_ | and then there's the issue of multi-way | 16:24 |
jonibo|laptop | but the or1200 cuts corners on 2) and invalidates everytime... that's fine, it's just less than optimal | 16:24 |
juliusb_ | which I've, again, gone with the simplest, quickest dirties way of handling it | 16:24 |
juliusb_ | :) | 16:24 |
jonibo|laptop | and your case 1) is a sw problem | 16:25 |
jonibo|laptop | we don't even have multi-way | 16:25 |
jonibo|laptop | ...in implementaiton, I mean | 16:25 |
juliusb_ | I think stekern has it working somewhere | 16:25 |
jonibo|laptop | ok | 16:25 |
juliusb_ | i'm quite close to being able to publish my new CPU - got word that a release has been drafted and i've just got to go through the process of showing it to people that can OK it | 16:26 |
juliusb_ | so i'd hope within a week or two | 16:26 |
juliusb_ | :) | 16:26 |
jonibo|laptop | yay! | 16:26 |
juliusb_ | ... but that's an aside, but I think stekern was playing with multi-way cache in that | 16:26 |
juliusb_ | i really need to run | 16:26 |
stekern | yes, that's correct, 2-way is (optionally) available there | 16:27 |
juliusb_ | :) | 16:27 |
jonibo|laptop | ok... hope you see my point, though | 16:27 |
jonibo|laptop | it's an issue for the spec | 16:27 |
juliusb_ | but, quick and dirty multi-way invalidate works well, too, and I imagine would be very simple to implement | 16:27 |
jonibo|laptop | it's _not_ an issue for the spec :) | 16:28 |
jonibo|laptop | it's for the implementation documentation | 16:28 |
juliusb_ | well, I want the implementations and the spec to be in harmony | 16:28 |
jonibo|laptop | no... | 16:28 |
juliusb_ | so we have to change one or the other | 16:28 |
jonibo|laptop | no, the implementation is just sub-optimal... but it's still correct | 16:29 |
jonibo|laptop | i don't see an issue here | 16:29 |
juliusb_ | no but it's not clear from spec how you clear it at reset | 16:29 |
stekern | What does the "Missing cache block in the local processor does not cause any action" mean? | 16:29 |
juliusb_ | or it's not clear what happens for multi-way | 16:29 |
jonibo|laptop | on the or1200 we can cheat because the BIR isn't so clever... on another implementation we can't cheat | 16:30 |
juliusb_ | stekern: mmm, yes, perhaps that's the sentence saying it should not do anything if the line isn't there | 16:30 |
jonibo|laptop | stekern: it means "no-op" | 16:30 |
juliusb_ | mmm ok | 16:30 |
juliusb_ | I'm wrong! | 16:30 |
juliusb_ | :) | 16:30 |
juliusb_ | in that case the OR1200 is wrong | 16:30 |
jonibo|laptop | not "wrong", "just-enlightened" :) | 16:30 |
juliusb_ | because it doesn't do nothing, it does invalidate the block | 16:30 |
juliusb_ | so in this case, one or the other must change | 16:31 |
juliusb_ | and im really late!! | 16:31 |
juliusb_ | bbl | 16:31 |
jonibo|laptop | ok, but I don't agree it has to change | 16:31 |
juliusb_ | (cache invalidate discussions are surprisingly exciting) | 16:31 |
jonibo|laptop | it's a nice "cheat_ | 16:31 |
jonibo|laptop | :) | 16:31 |
stekern | well, tbh, my cache implementation cheats too ;) | 16:31 |
jonibo|laptop | it just causes a performance degradation | 16:31 |
jonibo|laptop | cheats are fine as long as they are fundamentally correct... | 16:31 |
jonibo|laptop | poor performance is another issue altogether | 16:32 |
jonibo|laptop | stekern: how does your implementaion cheat? | 16:32 |
stekern | it does the invalidation the or1200 way | 16:32 |
jonibo|laptop | right... which is a performance hit, but nothing else | 16:33 |
stekern | yes, it's not gonna break software that expects correct behaviour | 16:33 |
jonibo|laptop | how? | 16:33 |
jonibo|laptop | I don't see why that would _break_ anything | 16:33 |
stekern | that's why it's an "OK" cheat | 16:33 |
jonibo|laptop | oh sorry, misread you | 16:33 |
stekern | :) | 16:33 |
jonibo|laptop | yeah, I agree | 16:33 |
jonibo|laptop | exactly, it's an implementation detail... those are fine... these get documented in the release notes and then you're done with it | 16:34 |
jonibo|laptop | but let's not update the arch spec to conform to the implementation just because somebody decide to cut that particular corner | 16:35 |
jonibo|laptop | as for the cache invalidation at reset... I'd say we just defer that discussion until we have an implementation that actually has a BIR that considers the address tag in question | 16:37 |
jonibo|laptop | ...it's moot until then | 16:37 |
-!- Netsplit *.net <-> *.split quits: jonibo | 23:54 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!