IRC logs for #openrisc Thursday, 2016-08-18

--- Log opened Thu Aug 18 00:00:30 2016
kc5tjaDepends on the longest gate propegation delay from input to output.00:28
kc5tjaOops, wrong channel.00:29
kc5tjaAlthough, that does apply to ZipCPU's question concerning why division (or multiplication for that matter) would slow the clock down on a CPU.00:30
wallentoZipCPU: I think 100-125MHz is probably a good target on common FPGAs01:38
wallentoCurrently I am running on 80.3 MHz01:38
ZipCPUstekern, wallento: Thank you!09:34
wallentoprobably you can drive it a lot higher09:35
ZipCPUWell ... that was part of my question.  Has any one driven it much higher?10:12
ZipCPU(Question addressed to wallento)10:13
wallentonot to my knowledge10:13
ZipCPUGot it.  Thank you!10:13
ZipCPUDo you know who did the dhrystone benchmark work?10:13
wallentono, I don't10:18
ZipCPUwallento: Thanks.11:48
ZipCPUkc5jta: I got that much, but the question really comes down to how do you separate logic across clocks?11:52
ZipCPUDo you increase the number of clocks a piece of logic takes, or slow down the clock to handle the extra logic?11:52
manaarwhen I am loading LINUX in the de0_nano board via telnet I am facing core dumped error given as -  Info : JTAG tap: or1k.cpu tap/device found: 0x020f30dd (mfg: 0x06e, part: 0x20f3, ver: 0x0) Error: or1k_assert_reset: implement me Error: or1k_deassert_reset: implement me openocd: driver.c:191: interface_jtag_add_dr_scan: Assertion `field == out_fields + scan->num_fields' failed. Aborted (core dumped)13:18
manaarplease help me to sort out this problem..13:19
kc5tjaZipCPU: The former approach implies deeper pipelines, while the latter implies wider fan-outs on gates (which increases loads).  Either way, you're increasing latency.13:48
kc5tjaYou can recover throughput in both designs: in the pipeline case, obviously, streamline your code so that more instructions sit between division and the use of the results it produces.13:49
ZipCPUCertainly!  But the deeper pipeline might mean that nothing else gets delayed, whereas slowing the clock down will delay *all* of the logic running off of that clock.13:49
kc5tjaIn the latter approach, make it a separate functional unit, and talk to it via queues and use an asynchronous handshake to indicate when the division is done.13:49
kc5tjaThe deeper pipeline also means greater impacts on mispredicted branchs (if you predict them at all).13:50
kc5tjaThere's tradeoffs no matter which direction you choose; but, the right approach will work better for a certain class of applications.13:50
kc5tja(Heh, tautology of the month right there.)13:51
ZipCPUYeah ... you were up late last night ;)14:03
ZipCPU<Rant>Remember how I messed up the timing miserably on the DDR3 memory controller so bad I'm having to start from scratch?  Since starting from scratch, I just realized I'm working with a DDR3-1333 (9-9-9) memory rather than the DDR3-1600 (11-11-11) memory that I thought I was working with. Time to redesign and start over *again*.  </Rant>  Sigh.14:11
jeremybennettZipCPU: Good luck with your Dhrystone work14:22
ZipCPUThanks jeremybennett.  I really appreciate your answers to my questions today!14:22
ZipCPUjeremybennett: I just started reading about superoptimization yesterday, while looking up your website.  It looks like a wonderfully fun project, and I look forward to hearing how well you do!14:23
jeremybennettZipCPU: It's all open. We're particularly keen to get others to pick up the work. How about a superoptimizer for OpenRISC?14:24
ZipCPUSo ... how hard is it to make it work for newer CPUs?14:24
ZipCPUi.e.: CPUs that are not in the current set.  Extending GCC is certainly doable, how difficult would it be to extend a superoptimizer?14:26
kc5tjaZipCPU: :(  I feel for you.  I just wish I could help, but I'm waaaay behind in my own project at the moment.14:32
kc5tjaI wrote a tool (in Shen, because learning new languages is [almost] always fun) that I hope will let me realize my goals faster.14:32
kc5tjaIt translates a table-oriented state machine description mapping inputs to outputs into corresponding Verilog code.14:33
kc5tjaAvoids case statements (thus avoiding priority encoder logic), and allows me to write "partial outputs" without fear of introducing latches (which consumes LCs behind my back and w/out my knowledge).14:33
ZipCPUkc5tja: I thought I might get a touch of compassion from you.  :)   Thanks for understanding.14:34
kc5tjaIt also lets me design multi-hot outputs, which I'm hoping will let me minimize the size fo the logic even further.14:34
ZipCPU"multi-hot outputs"?14:34
ZipCPUI'm all about minimizing logic size ... but ... can you explain please?14:35
kc5tjaNormally, for a given state S and inputs I, you have one and only one "next state" Sn and a set of outputs to drive stuff.  That's "single-hot" outputs.14:35
kc5tjaMulti-hot outputs means a *number* of different input states can trigger a *number* of sets of outputs.14:36
kc5tjaSo, for example...14:36
kc5tjaMost RISC-V instructions in my CPU take 4 cycles to execute.  So, depending on the current instruction register value, and the current T-counter value, I trigger things like "load register address from Rs1", "load register address from Rs2", "set ALU function code to blah", etc.14:37
kc5tjaHowever, at the same time, since none of this uses the bus, I can also pre-fetch the next value of the instruction register.14:37
kc5tjaIn effect, pipelining, but it's all controlled from a single batch of logic.14:37
kc5tjaSo, OP rd, rs1, imm12 or OP rd, rs1, rd2 (where OP is any ALU operation) can drive the ALU control buses and such with one batch of logic, and drive instruction fetch with another batch of logic.14:39
kc5tjaHowever, Lx and Sx (loads and stores, where x determines operand size), can't do this, so I cannot just generalize "instruction fetches always happens during T0-T3."14:39
ZipCPUHow is this different from traditional HDL programming?  From Mealy/Moore state machines, etc.14:39
kc5tjaIt *IS* state machine design.14:40
kc5tjaThe convenience is how I express it.14:40
kc5tjaLike I said, no more case statements.  They're outta here.14:40
ZipCPUSo it's all expressed as ... memories?14:40
kc5tjaThey take up *so* much room in the source text (>100 LOC) that it's nearly impossible to keep it all straight in my head.14:40
kc5tjaIt's expressed as a truth table.14:41
kc5tjaBut, more symbolic.14:41
kc5tjaSo, e.g.,14:41
kc5tja[on [[T 3'b000] ~irq ~reset valid] regadr_ir1 [nextT 3'b001]]14:42
kc5tja(excuse the S-expression syntax)14:42
kc5tjaThis is saying "when T = 3'b000 and not IRQ and not RESET and instruction valid, then set regadr to ir1 (which is a subfield fo the instruction register), and nextT to 3'b001.14:42
kc5tjaI don't have to worry about setting unused outputs to 0 all the time.14:42
kc5tjaI don't have to worry about unfulfilled case statements generating latches without my knowledge.14:43
kc5tjaElsewhere, I can have instruction fetch logic fire ont he same inputs.14:43
kc5tja[on [[T 3'b000] valid] adr_pc vpa size_2]14:43
kc5tja[on [[T 3'b001] ~ack] adr_pc vpa size_2 [nextT 3'b001]]14:44
kc5tja[on [[T 3'b001] ack] adr_pc vpa size_2 [nextT 3'b010]]14:44
kc5tja..etc..14:44
kc5tjaI can focus on one functional aspect at a time.14:44
kc5tjaAnd the expression is concise, yet expressive.14:44
kc5tjaIt's just a way of representing AND/OR logic in a better format that I hope will make things easier on me.14:45
kc5tjaBecause right now, Verilog is doing a piss-poor job of letting me progress beyond a certain size of logic complexity.14:45
kc5tjaI hope to have a real-world and (hopefully) working example come tomorrow.14:46
ZipCPUkc5tja: When I compare how you are building your CPU, to how I have built CPU's, I sort of need to ask: wouldn't it be a lot easier to handler logic complexity by splitting logic among clocks?14:47
kc5tjaBasically, going to rewrite the instruction decoder for the S64X7 using this as a proof-of-concept demonstration.14:47
kc5tjaI don't understand what you're saying.14:47
kc5tjaI take four clock cycles to execute the average CPU instruction.14:47
ZipCPUThat certainly sounds reasonable, I'm doing five cycles and one of those cycles is due to CPU specific logic.14:48
kc5tjaThe problem is scoping the complexity of the logic.14:48
kc5tjaRISC has a *LOT* of moving parts, which isn't often discussed.14:48
ZipCPUI'm just remembering your experience with the RISC-V CPU instruction decoder, and the massive number of combinatorial logic statements that appeared to require.14:48
kc5tjaYes; there are 24 classes of instructions that need to be decoded all total.14:49
kc5tjaWithin each class, there are edge cases which trigger illegal instruction traps.14:49
ZipCPU24 *classes* of instructions??  Ouch!  I'm working with 26 instructions!14:49
kc5tjaE.g., shifting left by 33 bits on a 32-bit RISC-V architecture (or 65 bits on a 64-bit) will trigger an illegal instruction trap.14:49
kc5tjaWell,14:50
kc5tjaloads are one class.14:50
kc5tjastores are another.14:50
kc5tjaOP and OP-imm are two more classes.14:50
kc5tjaIf you're implementing a 64-bit CPU like I am, you'll need OP32 and OP-imm-32 classes, so there's two more.14:51
kc5tjaThe SYSTEM class of instructions is what provides things like system-call, return-from-trap-handler instructions, read-modify-write CSR registers, etc.14:51
kc5tjaMemory synchronization is another class.  (These, thankfully, are just NOPs in my design.)14:52
kc5tjaAUIPC, LUI, JAL, and JALR are four more classes.14:52
ZipCPULet's see, I have a couple classes: ALU, memory, divide, FPU, and other.  Anything that writes back gets mapped to one of the big four.  (Memory synchronization is part of the "other")14:52
kc5tjaSo there's, what, 12 classes right there.14:52
kc5tjaA single 'class' of instructions can have typically up to 8 instructions, though sometimes it can have 10.14:53
ZipCPUI suppose this sort of makes sense: there's load word, load byte, load halfword, in a "load" class ...14:54
kc5tjaStill, that mass of combinatorial logic you saw was all because I needed to decode whether or not I had a valid instruction.14:54
ZipCPU(It was ugly ...)14:54
kc5tjaBecause RISC-V is pretty strict about trapping on illegal instruction forms.14:54
kc5tjaThe idea is that RISC-V is exceptionally well suited for virtualization, so anything which is not explicitly defined by an implementation or standard has to trap for virtualization purposes.14:55
ZipCPUWell ... after having a CPU go wild that didn't trap on illegal instructions ... and worse, it went wild through peripheral memory ...14:55
ZipCPU(my flash doesn't work the same anymore ...)14:55
ZipCPUI'm a strong supporter of illegal instruction detection ... now.14:55
kc5tjaThe one thing I *hate* about RISC-V though is how many different immediate operand forms it has.14:55
kc5tjaI refuse to believe that saving a transistor or two in an ASIC is worth the confounding what-the-fsckery that is having 4 different forms of immediates.14:56
kc5tja20-bit and 12-bit, OK, I get it.  6-bit too for shifts.14:56
kc5tjaBut, 21-bit (with bit 0 forced zero to fit in a 20-bit space), 20-bit, 13-bit (ditto), 12-bit, 6-bit, and 5-bit immediates, not all of which are contiguously arranged, is just asking for complexity.14:57
ZipCPULet's see: there's R, I, S, and U types ... right?14:57
kc5tjaRISC-V defines R, I, S, SB, U, and UJ instruction forms.  However, again, there're those stupid edge cases to worry about, because (just one example) FENCE instructions use a specialized imm12 encoding.14:58
kc5tjaSome FPU instructions also define an R3 form (4-operand instead of 3-operand instruction).14:59
kc5tjaThe number of immediates goes up by FOUR if you consider the compressed instruction format as well.14:59
kc5tjaSo it's all over the map, and it makes software emulation needlessly slow.14:59
ZipCPUHave you started back on your implementation of RISC-V yet, or are you still working on your 64-bit container CPU?15:01
kc5tjaS64X7 is essentially done, but it's still too big.  6400-ish LUTs (give or take about 200 LUTs.  Haven't compiled in a while.)15:01
kc5tjaPart of the reason for my writing this tool is to see how effective it is at reducing the size of the S64X7 (if at all), OR, failing that, to see if I can resume work on Polaris CPU and actually get to a point of completion.15:05
ZipCPUAre those 3-LUTs?15:06
kc5tja3-input look-up-tables.15:07
kc5tja(usually a 4LUT with a wasted input)15:07
ZipCPUOkay, that makes a lot of sense.  Consider me cheering you on from the side lines.  Although ... I've got to run now for lucnh.  Back in a bit.15:08
kc5tjaDitto; lunchtime here too.15:08
kc5tjaback16:53
ZipCPULikewise, but I'm trying to pretend I'm working now ;)16:53
ZipCPUSometimes I wish I could share all the markups I've made to the DDR3 specification.  It makes it much easier to read, and I'd love to distribute it with my DDR3 controller, but JEDEC is controlling the specification to such an extent that it cannot be freely exchanged.21:37
--- Log closed Fri Aug 19 00:00:32 2016

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!