--- Log opened Thu Jul 14 00:00:38 2016 | ||
stekern | ZipCPU|Laptop: stalling and bypass logic is definetely the hardest part | 01:23 |
---|---|---|
stekern | one trick to make it slightly less complicated is to avoid stalling wherever possible ;) | 01:24 |
wallento | hey everyone. I created a simple module that I strangely didn't find anywhere else: https://github.com/wallento/uartdpi | 02:43 |
ZipCPU | ... and here I just wrote a UART simulator that uses TCP/IP. You include the simulator in your Verilog build, connect the RX and TX lines, run the simulator, and you can connect to and interact with the simulator's UART via a telnet session. | 07:01 |
ZipCPU | stekern: That was my original plan. It included branch delay slots. Then I discovered that I needed to support interrupts, memory mapped peripherals with varying response times, caches that may or may not have the data of interest in them, ... | 07:08 |
ZipCPU | stekern: and that one (two?) special purpose register I have with only some bits that change, others that do "special things". (It only gets updated *after* writeback, and because of the special bits, you can't grab it from during writeback). | 07:10 |
kc5tja | ZipCPU: If it helps re: how to handle stalls in a pipeline, | 14:02 |
kc5tja | ZipCPU: I borrow the "stall_i" signal concept from Wishbone. When asserted, it forces that stage's registers to a no-operation condition. | 14:03 |
kc5tja | Thus, subsequent stages of the pipeline still operate, but do so using a deliberately injected no-op instruction (or condition). | 14:04 |
kc5tja | With regards to *stalling* proper, I build each pipeline stage using a hardwired microsequencer. Some microsequencers happen to have only one state, | 14:04 |
kc5tja | which degenerates to a "regular" pipe stage. | 14:04 |
ZipCPU | kc5tja: Thanks for the comment and thoughts. I'm actually doing something quite similar: I have _ce, _valid, and _stall lines for each stage. The logic is supposed to be as simple as ... | 14:05 |
ZipCPU | (okay ... tell me more about this microsequencer ... ?) | 14:05 |
kc5tja | But, where true stalls are required, or where I need multiple cycles (e.g., storing a 64-bit quantity over a 16-bit bus), the microsequencer is modular and isolated, yet fully cooperative with other stages. | 14:05 |
ZipCPU | Logic is supposed to be as simple as stage_n_stall = (stage_n_valid)&&(stage_n+1_stall), and stage_n_ce = (stage_n-1_valid)&&(~stage_n_stall), but the corner cases keep getting me. | 14:06 |
kc5tja | I got the idea from this site: http://www.pagetable.com/?p=39 | 14:06 |
ZipCPU | (Looking up the page now) | 14:06 |
kc5tja | Long and short of it, my attempt at building an RV64I RISC-V core using a "normal" pipeline quickly exceeded my FPGA's capacity. | 14:07 |
ZipCPU | (Sounds familiar ...) | 14:07 |
kc5tja | So I'm making a 6502-like "pipelined" processor, where instruction fetch happens concurrently with execution, but both takes 4 clock cycles. | 14:08 |
kc5tja | Yeah, iCE40s are cheap, but small. :) | 14:08 |
kc5tja | And I'm so used to using the Nexys2 FPGA (3S1000E part), that I never had a feel for "big" versus "small" designs until now. | 14:08 |
ZipCPU | (Do iCE40's still use 3-input LUTs?) Back to your comment, though: how long is your pipeline in total? | 14:08 |
kc5tja | 4-input, I believe. | 14:08 |
kc5tja | Well, I started with a 5-stage, but now it's just 2-stages (fetch and execute). | 14:09 |
kc5tja | I can get away with 2 because I can squeeze operand-fetch and write-back into otherwise unused clock cycles taken up by instruction fetch. | 14:09 |
ZipCPU | Really? I'm working on a 5-stage pipeline (still): Fetch, decode, add immediate, do-alu-operation, and write-back. | 14:10 |
kc5tja | It has its own costs though. | 14:10 |
kc5tja | Yeah, mine was fetch, decode-and-fetch-registers, execute, memory, and write-back. | 14:10 |
ZipCPU | Oh, and I'm also trying to run an IPC of 1 ... the pipeline stages are all designed so that they take no more than one clock. | 14:10 |
kc5tja | Basically, the same as MIPS. | 14:10 |
ZipCPU | Ok, that makes sense. I started this journey with little knowledge of MIPS, so my stages are a touch different. | 14:11 |
kc5tja | That's what I wanted, but I ended up needing 1800 4LUTs just for the ALU. That's half of my FPGA alone. :) | 14:11 |
ZipCPU | OUCH! | 14:11 |
ZipCPU | Is this all 64-bit math? Is that why your number is so large? | 14:11 |
kc5tja | Yep, a 64-bit data path. | 14:12 |
kc5tja | Not sure; I'm guessing it's possible that Yosys just isn't as mature a layout optimizer as a commercial tool. | 14:12 |
ZipCPU | Do you have multiple sized operations? Some on 64-bits, some on 32-bits, some on 16, and some on 8? | 14:12 |
kc5tja | Nope; only 64-bit. | 14:12 |
ZipCPU | Ok, sorry on the comment there, I'm used to about 2k (or less) 6-LUTs for the entire CPU. 2k LUTs on a 4-LUT system may make more sense. | 14:13 |
kc5tja | PicoRV32 takes around that much, and it too uses microsequencing I believe. | 14:13 |
kc5tja | ISTR it was around 1900 4LUTs for a fully spec'ed out 32-bit core. | 14:14 |
ZipCPU | "microsequencing" ... is this just another term for "microcode"? Like a CPU within a CPU to execute the instructions received? | 14:14 |
kc5tja | Microcode is a form of microsequencing, but the reverse isn't always the case. | 14:16 |
kc5tja | For example, the 68000 has microcode inside, but the 6502 does not. It relies on a hardwired state machine instead. | 14:16 |
kc5tja | I'm relying on a hardwired state machine. It was the only feasible way I could get the microarchitectural parallelism I needed to keep most instructions confined to four cycles. :) | 14:17 |
ZipCPU | Ok ... but you're still "sequencing" single operations across multiple clocks, right? | 14:18 |
kc5tja | Yes, that's correct. | 14:18 |
ZipCPU | So ... each pipeline stage takes multiple clocks before moving to the next stage? | 14:18 |
kc5tja | I use the term "microsequencing" because if I change from using a PLA-style decoder to microcode, I don't have to update a bunch of website copy. :) | 14:18 |
kc5tja | Right. I only have a 16-bit wide bus, and each bus cycle takes 2 clock cycles. So a full 32-bit opcode takes 4 clock cycles to acquire. | 14:19 |
kc5tja | That gives the execute stage enough time to fetch two registers (2 cycles) execute and write-back (3rd cycle), and if necessary, decide to take or skip a branch (4th cycle). | 14:19 |
ZipCPU | The 16-bit wide bus, is that your peripheral/memory bus whereby you interact with things outside of your CPU? | 14:20 |
kc5tja | Although, due to the way the pipeline works, all conditional and unconditional branches end up taking 8 cycles anyway. Bleh. | 14:20 |
kc5tja | Yes. | 14:20 |
kc5tja | I have an externalized Wishbone bus which I defined here: https://hackaday.io/project/11928-backbone-bus | 14:20 |
ZipCPU | Have you thought at all about running a "bus" internal to your core (between pipeline stages) that runs 64 bits wide? Or perhaps it just doesn't make sense? | 14:21 |
kc5tja | The idea is to build my home-made computer using an inexpensive (relatively) backplane, with even cheaper FPGA boards serving as I/O, CPU, and memory controllers. | 14:21 |
kc5tja | That's what I did originally, but I can't fit everything on the FPGA fabric if I do that. | 14:21 |
kc5tja | When I started my Kestrel-3 project, I wanted to continue to use my XC3S1000E FPGA on the Nexys-2, but (long story) I had to move to the iCE40 and Yosys toolchain. | 14:22 |
ZipCPU | This is absolutely fascinating, if you don't mind my saying so. | 14:22 |
kc5tja | I'd love to chat more, but I have a lunch meeting to attend, and I'm afraid I'm already a bit late. | 14:22 |
ZipCPU | Do you have a web page describing your kestrel-3 project? | 14:23 |
ZipCPU | Sure ... when you have time we can discuss more. Thanks for the comments, though! | 14:23 |
kc5tja | I have several, it's a bit schizophrenic at the moment. http://kestrelcomputer.github.io/kestrel is the main project page. https://leanpub.com/k3ug is the user's guide, and https://hackaday.io/project/10035-kestrel-computer-project is the "work in progress, me rambling about stuff" project page. | 14:24 |
ZipCPU | Cool! | 14:24 |
ZipCPU | kc5jta: For when you come back, that was a neat web page. Thanks for offering it to me to read. | 14:32 |
ZipCPU | (That refers to the 6502 web page--the only one I've managed to yet read top to bottom.) | 14:33 |
ZipCPU | I seem to have two pipeline issues with my current implementation, more with a new implementation I am trying to build. | 14:46 |
ZipCPU | Issue #1 has to do with the memory and ALU stages being parallel, the memory mapped I/O, and making sure that peripherals don't get a "read" instruction. Imagine, if you will, a conditional branch followed by a read. The branch hits writeback at the same time the read hits the memory unit. | 14:47 |
ZipCPU | Issue #2 has to do with the "special" condition codes register. This holds the 4 condition codes, the processor sleep flag, the stepping flag, error conditions, how the CPU was built, and more. All in 32-bits. This register is changed during writeback, but not available to be read until a clock later. | 14:48 |
ZipCPU | Condition codes may still be used within any given clock cycle--they are fed back directly from the ALU into the ALU--so using the condition codes isn't as difficult as reading the condition code register. | 14:49 |
ZipCPU | The new implementation is worse. The newer implementation sports a 9-stage pipeline, where 4 stages may have values in flight. The stage prior will need to stall if any of the values "in flight" refers to one of the values it needs. | 14:50 |
ZipCPU | I have yet to figure that problem out. | 14:50 |
kc5tja | I'm back | 21:17 |
stekern | ZipCPU: bypass logic and pipeline bubbles are what you need to solve the last problem | 22:03 |
stekern | e.g. the best part of this file handles bypass logic: https://github.com/openrisc/mor1kx/blob/master/rtl/verilog/mor1kx_rf_cappuccino.v | 22:06 |
stekern | and this: https://github.com/skristiansson/eco32f/blob/master/rtl/verilog/eco32f_registerfile.v | 22:07 |
--- Log closed Fri Jul 15 00:00:20 2016 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!