IRC logs for #openrisc Thursday, 2016-07-14

--- Log opened Thu Jul 14 00:00:38 2016
stekernZipCPU|Laptop: stalling and bypass logic is definetely the hardest part01:23
stekernone trick to make it slightly less complicated is to avoid stalling wherever possible ;)01:24
wallentohey everyone. I created a simple module that I strangely didn't find anywhere else:
ZipCPU... and here I just wrote a UART simulator that uses TCP/IP.  You include the simulator in your Verilog build, connect the RX and TX lines, run the simulator, and you can connect to and interact with the simulator's UART via a telnet session.07:01
ZipCPUstekern: That was my original plan.  It included branch delay slots.  Then I discovered that I needed to support interrupts, memory mapped peripherals with varying response times, caches that may or may not have the data of interest in them, ...07:08
ZipCPUstekern: and that one (two?) special purpose register I have with only some bits that change, others that do "special things".   (It only gets updated *after* writeback, and because of the special bits, you can't grab it from during writeback).07:10
kc5tjaZipCPU: If it helps re: how to handle stalls in a pipeline,14:02
kc5tjaZipCPU: I borrow the "stall_i" signal concept from Wishbone.  When asserted, it forces that stage's registers to a no-operation condition.14:03
kc5tjaThus, subsequent stages of the pipeline still operate, but do so using a deliberately injected no-op instruction (or condition).14:04
kc5tjaWith regards to *stalling* proper, I build each pipeline stage using a hardwired microsequencer.  Some microsequencers happen to have only one state,14:04
kc5tjawhich degenerates to a "regular" pipe stage.14:04
ZipCPUkc5tja: Thanks for the comment and thoughts.  I'm actually doing something quite similar: I have _ce, _valid, and _stall lines for each stage.  The logic is supposed to be as simple as ...14:05
ZipCPU(okay ... tell me more about this microsequencer ... ?)14:05
kc5tjaBut, where true stalls are required, or where I need multiple cycles (e.g., storing a 64-bit quantity over a 16-bit bus), the microsequencer is modular and isolated, yet fully cooperative with other stages.14:05
ZipCPULogic is supposed to be as simple as stage_n_stall = (stage_n_valid)&&(stage_n+1_stall), and stage_n_ce = (stage_n-1_valid)&&(~stage_n_stall), but the corner cases keep getting me.14:06
kc5tjaI got the idea from this site:
ZipCPU(Looking up the page now)14:06
kc5tjaLong and short of it, my attempt at building an RV64I RISC-V core using a "normal" pipeline quickly exceeded my FPGA's capacity.14:07
ZipCPU(Sounds familiar ...)14:07
kc5tjaSo I'm making a 6502-like "pipelined" processor, where instruction fetch happens concurrently with execution, but both takes 4 clock cycles.14:08
kc5tjaYeah, iCE40s are cheap, but small.  :)14:08
kc5tjaAnd I'm so used to using the Nexys2 FPGA (3S1000E part), that I never had a feel for "big" versus "small" designs until now.14:08
ZipCPU(Do iCE40's still use 3-input LUTs?)  Back to your comment, though: how long is your pipeline in total?14:08
kc5tja4-input, I believe.14:08
kc5tjaWell, I started with a 5-stage, but now it's just 2-stages (fetch and execute).14:09
kc5tjaI can get away with 2 because I can squeeze operand-fetch and write-back into otherwise unused clock cycles taken up by instruction fetch.14:09
ZipCPUReally?  I'm working on a 5-stage pipeline (still): Fetch, decode, add immediate, do-alu-operation, and write-back.14:10
kc5tjaIt has its own costs though.14:10
kc5tjaYeah, mine was fetch, decode-and-fetch-registers, execute, memory, and write-back.14:10
ZipCPUOh, and I'm also trying to run an IPC of 1 ... the pipeline stages are all designed so that they take no more than one clock.14:10
kc5tjaBasically, the same as MIPS.14:10
ZipCPUOk, that makes sense.  I started this journey with little knowledge of MIPS, so my stages are a touch different.14:11
kc5tjaThat's what I wanted, but I ended up needing 1800 4LUTs just for the ALU.  That's half of my FPGA alone.  :)14:11
ZipCPUIs this all 64-bit math?  Is that why your number is so large?14:11
kc5tjaYep, a 64-bit data path.14:12
kc5tjaNot sure; I'm guessing it's possible that Yosys just isn't as mature a layout optimizer as a commercial tool.14:12
ZipCPUDo you have multiple sized operations?  Some on 64-bits, some on 32-bits, some on 16, and some on 8?14:12
kc5tjaNope; only 64-bit.14:12
ZipCPUOk, sorry on the comment there, I'm used to about 2k (or less) 6-LUTs for the entire CPU.  2k LUTs on a 4-LUT system may make more sense.14:13
kc5tjaPicoRV32 takes around that much, and it too uses microsequencing I believe.14:13
kc5tjaISTR it was around 1900 4LUTs for a fully spec'ed out 32-bit core.14:14
ZipCPU"microsequencing" ... is this just another term for "microcode"?  Like a CPU within a CPU to execute the instructions received?14:14
kc5tjaMicrocode is a form of microsequencing, but the reverse isn't always the case.14:16
kc5tjaFor example, the 68000 has microcode inside, but the 6502 does not.  It relies on a hardwired state machine instead.14:16
kc5tjaI'm relying on a hardwired state machine.  It was the only feasible way I could get the microarchitectural parallelism I needed to keep most instructions confined to four cycles.  :)14:17
ZipCPUOk ... but you're still "sequencing" single operations across multiple clocks, right?14:18
kc5tjaYes, that's correct.14:18
ZipCPUSo ... each pipeline stage takes multiple clocks before moving to the next stage?14:18
kc5tjaI use the term "microsequencing" because if I change from using a PLA-style decoder to microcode, I don't have to update a bunch of website copy.  :)14:18
kc5tjaRight.  I only have a 16-bit wide bus, and each bus cycle takes 2 clock cycles.  So a full 32-bit opcode takes 4 clock cycles to acquire.14:19
kc5tjaThat gives the execute stage enough time to fetch two registers (2 cycles) execute and write-back (3rd cycle), and if necessary, decide to take or skip a branch (4th cycle).14:19
ZipCPUThe 16-bit wide bus, is that your peripheral/memory bus whereby you interact with things outside of your CPU?14:20
kc5tjaAlthough, due to the way the pipeline works, all conditional and unconditional branches end up taking 8 cycles anyway.  Bleh.14:20
kc5tjaI have an externalized Wishbone bus which I defined here:
ZipCPUHave you thought at all about running a "bus" internal to your core (between pipeline stages) that runs 64 bits wide?  Or perhaps it just doesn't make sense?14:21
kc5tjaThe idea is to build my home-made computer using an inexpensive (relatively) backplane, with even cheaper FPGA boards serving as I/O, CPU, and memory controllers.14:21
kc5tjaThat's what I did originally, but I can't fit everything on the FPGA fabric if I do that.14:21
kc5tjaWhen I started my Kestrel-3 project, I wanted to continue to use my XC3S1000E FPGA on the Nexys-2, but (long story) I had to move to the iCE40 and Yosys toolchain.14:22
ZipCPUThis is absolutely fascinating, if you don't mind my saying so.14:22
kc5tjaI'd love to chat more, but I have a lunch meeting to attend, and I'm afraid I'm already a bit late.14:22
ZipCPUDo you have a web page describing your kestrel-3 project?14:23
ZipCPUSure ... when you have time we can discuss more.  Thanks for the comments, though!14:23
kc5tjaI have several, it's a bit schizophrenic at the moment. is the main project page. is the user's guide, and is the "work in progress, me rambling about stuff" project page.14:24
ZipCPUkc5jta: For when you come back, that was a neat web page.  Thanks for offering it to me to read.14:32
ZipCPU(That refers to the 6502 web page--the only one I've managed to yet read top to bottom.)14:33
ZipCPUI seem to have two pipeline issues with my current implementation, more with a new implementation I am trying to build.14:46
ZipCPUIssue #1 has to do with the memory and ALU stages being parallel, the memory mapped I/O, and making sure that peripherals don't get a "read" instruction.  Imagine, if you will, a conditional branch followed by a read.  The branch hits writeback at the same time the read hits the memory unit.14:47
ZipCPUIssue #2 has to do with the "special" condition codes register.  This holds the 4 condition codes, the processor sleep flag, the stepping flag, error conditions, how the CPU was built, and more.  All in 32-bits.  This register is changed during writeback, but not available to be read until a clock later.14:48
ZipCPUCondition codes may still be used within any given clock cycle--they are fed back directly from the ALU into the ALU--so using the condition codes isn't as difficult as reading the condition code register.14:49
ZipCPUThe new implementation is worse.  The newer implementation sports a 9-stage pipeline, where 4 stages may have values in flight.  The stage prior will need to stall if any of the values "in flight" refers to one of the values it needs.14:50
ZipCPUI have yet to figure that problem out.14:50
kc5tjaI'm back21:17
stekernZipCPU: bypass logic and pipeline bubbles are what you need to solve the last problem22:03
stekerne.g. the best part of this file handles bypass logic:
stekernand this:
--- Log closed Fri Jul 15 00:00:20 2016

Generated by 2.15.2 by Marius Gedminas - find it at!