IRC logs for #openrisc Thursday, 2016-07-14

--- Log opened Thu Jul 14 00:00:38 2016
stekern	ZipCPU\|Laptop: stalling and bypass logic is definetely the hardest part	01:23
stekern	one trick to make it slightly less complicated is to avoid stalling wherever possible ;)	01:24
wallento	hey everyone. I created a simple module that I strangely didn't find anywhere else: https://github.com/wallento/uartdpi	02:43
ZipCPU	... and here I just wrote a UART simulator that uses TCP/IP. You include the simulator in your Verilog build, connect the RX and TX lines, run the simulator, and you can connect to and interact with the simulator's UART via a telnet session.	07:01
ZipCPU	stekern: That was my original plan. It included branch delay slots. Then I discovered that I needed to support interrupts, memory mapped peripherals with varying response times, caches that may or may not have the data of interest in them, ...	07:08
ZipCPU	stekern: and that one (two?) special purpose register I have with only some bits that change, others that do "special things". (It only gets updated after writeback, and because of the special bits, you can't grab it from during writeback).	07:10
kc5tja	ZipCPU: If it helps re: how to handle stalls in a pipeline,	14:02
kc5tja	ZipCPU: I borrow the "stall_i" signal concept from Wishbone. When asserted, it forces that stage's registers to a no-operation condition.	14:03
kc5tja	Thus, subsequent stages of the pipeline still operate, but do so using a deliberately injected no-op instruction (or condition).	14:04
kc5tja	With regards to stalling proper, I build each pipeline stage using a hardwired microsequencer. Some microsequencers happen to have only one state,	14:04
kc5tja	which degenerates to a "regular" pipe stage.	14:04
ZipCPU	kc5tja: Thanks for the comment and thoughts. I'm actually doing something quite similar: I have _ce, _valid, and _stall lines for each stage. The logic is supposed to be as simple as ...	14:05
ZipCPU	(okay ... tell me more about this microsequencer ... ?)	14:05
kc5tja	But, where true stalls are required, or where I need multiple cycles (e.g., storing a 64-bit quantity over a 16-bit bus), the microsequencer is modular and isolated, yet fully cooperative with other stages.	14:05
ZipCPU	Logic is supposed to be as simple as stage_n_stall = (stage_n_valid)&&(stage_n+1_stall), and stage_n_ce = (stage_n-1_valid)&&(~stage_n_stall), but the corner cases keep getting me.	14:06
kc5tja	I got the idea from this site: http://www.pagetable.com/?p=39	14:06
ZipCPU	(Looking up the page now)	14:06
kc5tja	Long and short of it, my attempt at building an RV64I RISC-V core using a "normal" pipeline quickly exceeded my FPGA's capacity.	14:07
ZipCPU	(Sounds familiar ...)	14:07
kc5tja	So I'm making a 6502-like "pipelined" processor, where instruction fetch happens concurrently with execution, but both takes 4 clock cycles.	14:08
kc5tja	Yeah, iCE40s are cheap, but small. :)	14:08
kc5tja	And I'm so used to using the Nexys2 FPGA (3S1000E part), that I never had a feel for "big" versus "small" designs until now.	14:08
ZipCPU	(Do iCE40's still use 3-input LUTs?) Back to your comment, though: how long is your pipeline in total?	14:08
kc5tja	4-input, I believe.	14:08
kc5tja	Well, I started with a 5-stage, but now it's just 2-stages (fetch and execute).	14:09
kc5tja	I can get away with 2 because I can squeeze operand-fetch and write-back into otherwise unused clock cycles taken up by instruction fetch.	14:09
ZipCPU	Really? I'm working on a 5-stage pipeline (still): Fetch, decode, add immediate, do-alu-operation, and write-back.	14:10
kc5tja	It has its own costs though.	14:10
kc5tja	Yeah, mine was fetch, decode-and-fetch-registers, execute, memory, and write-back.	14:10
ZipCPU	Oh, and I'm also trying to run an IPC of 1 ... the pipeline stages are all designed so that they take no more than one clock.	14:10
kc5tja	Basically, the same as MIPS.	14:10
ZipCPU	Ok, that makes sense. I started this journey with little knowledge of MIPS, so my stages are a touch different.	14:11
kc5tja	That's what I wanted, but I ended up needing 1800 4LUTs just for the ALU. That's half of my FPGA alone. :)	14:11
ZipCPU	OUCH!	14:11
ZipCPU	Is this all 64-bit math? Is that why your number is so large?	14:11
kc5tja	Yep, a 64-bit data path.	14:12
kc5tja	Not sure; I'm guessing it's possible that Yosys just isn't as mature a layout optimizer as a commercial tool.	14:12
ZipCPU	Do you have multiple sized operations? Some on 64-bits, some on 32-bits, some on 16, and some on 8?	14:12
kc5tja	Nope; only 64-bit.	14:12
ZipCPU	Ok, sorry on the comment there, I'm used to about 2k (or less) 6-LUTs for the entire CPU. 2k LUTs on a 4-LUT system may make more sense.	14:13
kc5tja	PicoRV32 takes around that much, and it too uses microsequencing I believe.	14:13
kc5tja	ISTR it was around 1900 4LUTs for a fully spec'ed out 32-bit core.	14:14
ZipCPU	"microsequencing" ... is this just another term for "microcode"? Like a CPU within a CPU to execute the instructions received?	14:14
kc5tja	Microcode is a form of microsequencing, but the reverse isn't always the case.	14:16
kc5tja	For example, the 68000 has microcode inside, but the 6502 does not. It relies on a hardwired state machine instead.	14:16
kc5tja	I'm relying on a hardwired state machine. It was the only feasible way I could get the microarchitectural parallelism I needed to keep most instructions confined to four cycles. :)	14:17
ZipCPU	Ok ... but you're still "sequencing" single operations across multiple clocks, right?	14:18
kc5tja	Yes, that's correct.	14:18
ZipCPU	So ... each pipeline stage takes multiple clocks before moving to the next stage?	14:18
kc5tja	I use the term "microsequencing" because if I change from using a PLA-style decoder to microcode, I don't have to update a bunch of website copy. :)	14:18
kc5tja	Right. I only have a 16-bit wide bus, and each bus cycle takes 2 clock cycles. So a full 32-bit opcode takes 4 clock cycles to acquire.	14:19
kc5tja	That gives the execute stage enough time to fetch two registers (2 cycles) execute and write-back (3rd cycle), and if necessary, decide to take or skip a branch (4th cycle).	14:19
ZipCPU	The 16-bit wide bus, is that your peripheral/memory bus whereby you interact with things outside of your CPU?	14:20
kc5tja	Although, due to the way the pipeline works, all conditional and unconditional branches end up taking 8 cycles anyway. Bleh.	14:20
kc5tja	Yes.	14:20
kc5tja	I have an externalized Wishbone bus which I defined here: https://hackaday.io/project/11928-backbone-bus	14:20
ZipCPU	Have you thought at all about running a "bus" internal to your core (between pipeline stages) that runs 64 bits wide? Or perhaps it just doesn't make sense?	14:21
kc5tja	The idea is to build my home-made computer using an inexpensive (relatively) backplane, with even cheaper FPGA boards serving as I/O, CPU, and memory controllers.	14:21
kc5tja	That's what I did originally, but I can't fit everything on the FPGA fabric if I do that.	14:21
kc5tja	When I started my Kestrel-3 project, I wanted to continue to use my XC3S1000E FPGA on the Nexys-2, but (long story) I had to move to the iCE40 and Yosys toolchain.	14:22
ZipCPU	This is absolutely fascinating, if you don't mind my saying so.	14:22
kc5tja	I'd love to chat more, but I have a lunch meeting to attend, and I'm afraid I'm already a bit late.	14:22
ZipCPU	Do you have a web page describing your kestrel-3 project?	14:23
ZipCPU	Sure ... when you have time we can discuss more. Thanks for the comments, though!	14:23
kc5tja	I have several, it's a bit schizophrenic at the moment. http://kestrelcomputer.github.io/kestrel is the main project page. https://leanpub.com/k3ug is the user's guide, and https://hackaday.io/project/10035-kestrel-computer-project is the "work in progress, me rambling about stuff" project page.	14:24
ZipCPU	Cool!	14:24
ZipCPU	kc5jta: For when you come back, that was a neat web page. Thanks for offering it to me to read.	14:32
ZipCPU	(That refers to the 6502 web page--the only one I've managed to yet read top to bottom.)	14:33
ZipCPU	I seem to have two pipeline issues with my current implementation, more with a new implementation I am trying to build.	14:46
ZipCPU	Issue #1 has to do with the memory and ALU stages being parallel, the memory mapped I/O, and making sure that peripherals don't get a "read" instruction. Imagine, if you will, a conditional branch followed by a read. The branch hits writeback at the same time the read hits the memory unit.	14:47
ZipCPU	Issue #2 has to do with the "special" condition codes register. This holds the 4 condition codes, the processor sleep flag, the stepping flag, error conditions, how the CPU was built, and more. All in 32-bits. This register is changed during writeback, but not available to be read until a clock later.	14:48
ZipCPU	Condition codes may still be used within any given clock cycle--they are fed back directly from the ALU into the ALU--so using the condition codes isn't as difficult as reading the condition code register.	14:49
ZipCPU	The new implementation is worse. The newer implementation sports a 9-stage pipeline, where 4 stages may have values in flight. The stage prior will need to stall if any of the values "in flight" refers to one of the values it needs.	14:50
ZipCPU	I have yet to figure that problem out.	14:50
kc5tja	I'm back	21:17
stekern	ZipCPU: bypass logic and pipeline bubbles are what you need to solve the last problem	22:03
stekern	e.g. the best part of this file handles bypass logic: https://github.com/openrisc/mor1kx/blob/master/rtl/verilog/mor1kx_rf_cappuccino.v	22:06
stekern	and this: https://github.com/skristiansson/eco32f/blob/master/rtl/verilog/eco32f_registerfile.v	22:07
--- Log closed Fri Jul 15 00:00:20 2016

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!