--- Log opened Thu Aug 18 00:00:30 2016 | ||
kc5tja | Depends on the longest gate propegation delay from input to output. | 00:28 |
---|---|---|
kc5tja | Oops, wrong channel. | 00:29 |
kc5tja | Although, that does apply to ZipCPU's question concerning why division (or multiplication for that matter) would slow the clock down on a CPU. | 00:30 |
wallento | ZipCPU: I think 100-125MHz is probably a good target on common FPGAs | 01:38 |
wallento | Currently I am running on 80.3 MHz | 01:38 |
ZipCPU | stekern, wallento: Thank you! | 09:34 |
wallento | probably you can drive it a lot higher | 09:35 |
ZipCPU | Well ... that was part of my question. Has any one driven it much higher? | 10:12 |
ZipCPU | (Question addressed to wallento) | 10:13 |
wallento | not to my knowledge | 10:13 |
ZipCPU | Got it. Thank you! | 10:13 |
ZipCPU | Do you know who did the dhrystone benchmark work? | 10:13 |
wallento | no, I don't | 10:18 |
ZipCPU | wallento: Thanks. | 11:48 |
ZipCPU | kc5jta: I got that much, but the question really comes down to how do you separate logic across clocks? | 11:52 |
ZipCPU | Do you increase the number of clocks a piece of logic takes, or slow down the clock to handle the extra logic? | 11:52 |
manaar | when I am loading LINUX in the de0_nano board via telnet I am facing core dumped error given as - Info : JTAG tap: or1k.cpu tap/device found: 0x020f30dd (mfg: 0x06e, part: 0x20f3, ver: 0x0) Error: or1k_assert_reset: implement me Error: or1k_deassert_reset: implement me openocd: driver.c:191: interface_jtag_add_dr_scan: Assertion `field == out_fields + scan->num_fields' failed. Aborted (core dumped) | 13:18 |
manaar | please help me to sort out this problem.. | 13:19 |
kc5tja | ZipCPU: The former approach implies deeper pipelines, while the latter implies wider fan-outs on gates (which increases loads). Either way, you're increasing latency. | 13:48 |
kc5tja | You can recover throughput in both designs: in the pipeline case, obviously, streamline your code so that more instructions sit between division and the use of the results it produces. | 13:49 |
ZipCPU | Certainly! But the deeper pipeline might mean that nothing else gets delayed, whereas slowing the clock down will delay *all* of the logic running off of that clock. | 13:49 |
kc5tja | In the latter approach, make it a separate functional unit, and talk to it via queues and use an asynchronous handshake to indicate when the division is done. | 13:49 |
kc5tja | The deeper pipeline also means greater impacts on mispredicted branchs (if you predict them at all). | 13:50 |
kc5tja | There's tradeoffs no matter which direction you choose; but, the right approach will work better for a certain class of applications. | 13:50 |
kc5tja | (Heh, tautology of the month right there.) | 13:51 |
ZipCPU | Yeah ... you were up late last night ;) | 14:03 |
ZipCPU | <Rant>Remember how I messed up the timing miserably on the DDR3 memory controller so bad I'm having to start from scratch? Since starting from scratch, I just realized I'm working with a DDR3-1333 (9-9-9) memory rather than the DDR3-1600 (11-11-11) memory that I thought I was working with. Time to redesign and start over *again*. </Rant> Sigh. | 14:11 |
jeremybennett | ZipCPU: Good luck with your Dhrystone work | 14:22 |
ZipCPU | Thanks jeremybennett. I really appreciate your answers to my questions today! | 14:22 |
ZipCPU | jeremybennett: I just started reading about superoptimization yesterday, while looking up your website. It looks like a wonderfully fun project, and I look forward to hearing how well you do! | 14:23 |
jeremybennett | ZipCPU: It's all open. We're particularly keen to get others to pick up the work. How about a superoptimizer for OpenRISC? | 14:24 |
ZipCPU | So ... how hard is it to make it work for newer CPUs? | 14:24 |
ZipCPU | i.e.: CPUs that are not in the current set. Extending GCC is certainly doable, how difficult would it be to extend a superoptimizer? | 14:26 |
kc5tja | ZipCPU: :( I feel for you. I just wish I could help, but I'm waaaay behind in my own project at the moment. | 14:32 |
kc5tja | I wrote a tool (in Shen, because learning new languages is [almost] always fun) that I hope will let me realize my goals faster. | 14:32 |
kc5tja | It translates a table-oriented state machine description mapping inputs to outputs into corresponding Verilog code. | 14:33 |
kc5tja | Avoids case statements (thus avoiding priority encoder logic), and allows me to write "partial outputs" without fear of introducing latches (which consumes LCs behind my back and w/out my knowledge). | 14:33 |
ZipCPU | kc5tja: I thought I might get a touch of compassion from you. :) Thanks for understanding. | 14:34 |
kc5tja | It also lets me design multi-hot outputs, which I'm hoping will let me minimize the size fo the logic even further. | 14:34 |
ZipCPU | "multi-hot outputs"? | 14:34 |
ZipCPU | I'm all about minimizing logic size ... but ... can you explain please? | 14:35 |
kc5tja | Normally, for a given state S and inputs I, you have one and only one "next state" Sn and a set of outputs to drive stuff. That's "single-hot" outputs. | 14:35 |
kc5tja | Multi-hot outputs means a *number* of different input states can trigger a *number* of sets of outputs. | 14:36 |
kc5tja | So, for example... | 14:36 |
kc5tja | Most RISC-V instructions in my CPU take 4 cycles to execute. So, depending on the current instruction register value, and the current T-counter value, I trigger things like "load register address from Rs1", "load register address from Rs2", "set ALU function code to blah", etc. | 14:37 |
kc5tja | However, at the same time, since none of this uses the bus, I can also pre-fetch the next value of the instruction register. | 14:37 |
kc5tja | In effect, pipelining, but it's all controlled from a single batch of logic. | 14:37 |
kc5tja | So, OP rd, rs1, imm12 or OP rd, rs1, rd2 (where OP is any ALU operation) can drive the ALU control buses and such with one batch of logic, and drive instruction fetch with another batch of logic. | 14:39 |
kc5tja | However, Lx and Sx (loads and stores, where x determines operand size), can't do this, so I cannot just generalize "instruction fetches always happens during T0-T3." | 14:39 |
ZipCPU | How is this different from traditional HDL programming? From Mealy/Moore state machines, etc. | 14:39 |
kc5tja | It *IS* state machine design. | 14:40 |
kc5tja | The convenience is how I express it. | 14:40 |
kc5tja | Like I said, no more case statements. They're outta here. | 14:40 |
ZipCPU | So it's all expressed as ... memories? | 14:40 |
kc5tja | They take up *so* much room in the source text (>100 LOC) that it's nearly impossible to keep it all straight in my head. | 14:40 |
kc5tja | It's expressed as a truth table. | 14:41 |
kc5tja | But, more symbolic. | 14:41 |
kc5tja | So, e.g., | 14:41 |
kc5tja | [on [[T 3'b000] ~irq ~reset valid] regadr_ir1 [nextT 3'b001]] | 14:42 |
kc5tja | (excuse the S-expression syntax) | 14:42 |
kc5tja | This is saying "when T = 3'b000 and not IRQ and not RESET and instruction valid, then set regadr to ir1 (which is a subfield fo the instruction register), and nextT to 3'b001. | 14:42 |
kc5tja | I don't have to worry about setting unused outputs to 0 all the time. | 14:42 |
kc5tja | I don't have to worry about unfulfilled case statements generating latches without my knowledge. | 14:43 |
kc5tja | Elsewhere, I can have instruction fetch logic fire ont he same inputs. | 14:43 |
kc5tja | [on [[T 3'b000] valid] adr_pc vpa size_2] | 14:43 |
kc5tja | [on [[T 3'b001] ~ack] adr_pc vpa size_2 [nextT 3'b001]] | 14:44 |
kc5tja | [on [[T 3'b001] ack] adr_pc vpa size_2 [nextT 3'b010]] | 14:44 |
kc5tja | ..etc.. | 14:44 |
kc5tja | I can focus on one functional aspect at a time. | 14:44 |
kc5tja | And the expression is concise, yet expressive. | 14:44 |
kc5tja | It's just a way of representing AND/OR logic in a better format that I hope will make things easier on me. | 14:45 |
kc5tja | Because right now, Verilog is doing a piss-poor job of letting me progress beyond a certain size of logic complexity. | 14:45 |
kc5tja | I hope to have a real-world and (hopefully) working example come tomorrow. | 14:46 |
ZipCPU | kc5tja: When I compare how you are building your CPU, to how I have built CPU's, I sort of need to ask: wouldn't it be a lot easier to handler logic complexity by splitting logic among clocks? | 14:47 |
kc5tja | Basically, going to rewrite the instruction decoder for the S64X7 using this as a proof-of-concept demonstration. | 14:47 |
kc5tja | I don't understand what you're saying. | 14:47 |
kc5tja | I take four clock cycles to execute the average CPU instruction. | 14:47 |
ZipCPU | That certainly sounds reasonable, I'm doing five cycles and one of those cycles is due to CPU specific logic. | 14:48 |
kc5tja | The problem is scoping the complexity of the logic. | 14:48 |
kc5tja | RISC has a *LOT* of moving parts, which isn't often discussed. | 14:48 |
ZipCPU | I'm just remembering your experience with the RISC-V CPU instruction decoder, and the massive number of combinatorial logic statements that appeared to require. | 14:48 |
kc5tja | Yes; there are 24 classes of instructions that need to be decoded all total. | 14:49 |
kc5tja | Within each class, there are edge cases which trigger illegal instruction traps. | 14:49 |
ZipCPU | 24 *classes* of instructions?? Ouch! I'm working with 26 instructions! | 14:49 |
kc5tja | E.g., shifting left by 33 bits on a 32-bit RISC-V architecture (or 65 bits on a 64-bit) will trigger an illegal instruction trap. | 14:49 |
kc5tja | Well, | 14:50 |
kc5tja | loads are one class. | 14:50 |
kc5tja | stores are another. | 14:50 |
kc5tja | OP and OP-imm are two more classes. | 14:50 |
kc5tja | If you're implementing a 64-bit CPU like I am, you'll need OP32 and OP-imm-32 classes, so there's two more. | 14:51 |
kc5tja | The SYSTEM class of instructions is what provides things like system-call, return-from-trap-handler instructions, read-modify-write CSR registers, etc. | 14:51 |
kc5tja | Memory synchronization is another class. (These, thankfully, are just NOPs in my design.) | 14:52 |
kc5tja | AUIPC, LUI, JAL, and JALR are four more classes. | 14:52 |
ZipCPU | Let's see, I have a couple classes: ALU, memory, divide, FPU, and other. Anything that writes back gets mapped to one of the big four. (Memory synchronization is part of the "other") | 14:52 |
kc5tja | So there's, what, 12 classes right there. | 14:52 |
kc5tja | A single 'class' of instructions can have typically up to 8 instructions, though sometimes it can have 10. | 14:53 |
ZipCPU | I suppose this sort of makes sense: there's load word, load byte, load halfword, in a "load" class ... | 14:54 |
kc5tja | Still, that mass of combinatorial logic you saw was all because I needed to decode whether or not I had a valid instruction. | 14:54 |
ZipCPU | (It was ugly ...) | 14:54 |
kc5tja | Because RISC-V is pretty strict about trapping on illegal instruction forms. | 14:54 |
kc5tja | The idea is that RISC-V is exceptionally well suited for virtualization, so anything which is not explicitly defined by an implementation or standard has to trap for virtualization purposes. | 14:55 |
ZipCPU | Well ... after having a CPU go wild that didn't trap on illegal instructions ... and worse, it went wild through peripheral memory ... | 14:55 |
ZipCPU | (my flash doesn't work the same anymore ...) | 14:55 |
ZipCPU | I'm a strong supporter of illegal instruction detection ... now. | 14:55 |
kc5tja | The one thing I *hate* about RISC-V though is how many different immediate operand forms it has. | 14:55 |
kc5tja | I refuse to believe that saving a transistor or two in an ASIC is worth the confounding what-the-fsckery that is having 4 different forms of immediates. | 14:56 |
kc5tja | 20-bit and 12-bit, OK, I get it. 6-bit too for shifts. | 14:56 |
kc5tja | But, 21-bit (with bit 0 forced zero to fit in a 20-bit space), 20-bit, 13-bit (ditto), 12-bit, 6-bit, and 5-bit immediates, not all of which are contiguously arranged, is just asking for complexity. | 14:57 |
ZipCPU | Let's see: there's R, I, S, and U types ... right? | 14:57 |
kc5tja | RISC-V defines R, I, S, SB, U, and UJ instruction forms. However, again, there're those stupid edge cases to worry about, because (just one example) FENCE instructions use a specialized imm12 encoding. | 14:58 |
kc5tja | Some FPU instructions also define an R3 form (4-operand instead of 3-operand instruction). | 14:59 |
kc5tja | The number of immediates goes up by FOUR if you consider the compressed instruction format as well. | 14:59 |
kc5tja | So it's all over the map, and it makes software emulation needlessly slow. | 14:59 |
ZipCPU | Have you started back on your implementation of RISC-V yet, or are you still working on your 64-bit container CPU? | 15:01 |
kc5tja | S64X7 is essentially done, but it's still too big. 6400-ish LUTs (give or take about 200 LUTs. Haven't compiled in a while.) | 15:01 |
kc5tja | Part of the reason for my writing this tool is to see how effective it is at reducing the size of the S64X7 (if at all), OR, failing that, to see if I can resume work on Polaris CPU and actually get to a point of completion. | 15:05 |
ZipCPU | Are those 3-LUTs? | 15:06 |
kc5tja | 3-input look-up-tables. | 15:07 |
kc5tja | (usually a 4LUT with a wasted input) | 15:07 |
ZipCPU | Okay, that makes a lot of sense. Consider me cheering you on from the side lines. Although ... I've got to run now for lucnh. Back in a bit. | 15:08 |
kc5tja | Ditto; lunchtime here too. | 15:08 |
kc5tja | back | 16:53 |
ZipCPU | Likewise, but I'm trying to pretend I'm working now ;) | 16:53 |
ZipCPU | Sometimes I wish I could share all the markups I've made to the DDR3 specification. It makes it much easier to read, and I'd love to distribute it with my DDR3 controller, but JEDEC is controlling the specification to such an extent that it cannot be freely exchanged. | 21:37 |
--- Log closed Fri Aug 19 00:00:32 2016 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!