IRC logs for #openrisc Thursday, 2016-08-18

--- Log opened Thu Aug 18 00:00:30 2016
kc5tja	Depends on the longest gate propegation delay from input to output.	00:28
kc5tja	Oops, wrong channel.	00:29
kc5tja	Although, that does apply to ZipCPU's question concerning why division (or multiplication for that matter) would slow the clock down on a CPU.	00:30
wallento	ZipCPU: I think 100-125MHz is probably a good target on common FPGAs	01:38
wallento	Currently I am running on 80.3 MHz	01:38
ZipCPU	stekern, wallento: Thank you!	09:34
wallento	probably you can drive it a lot higher	09:35
ZipCPU	Well ... that was part of my question. Has any one driven it much higher?	10:12
ZipCPU	(Question addressed to wallento)	10:13
wallento	not to my knowledge	10:13
ZipCPU	Got it. Thank you!	10:13
ZipCPU	Do you know who did the dhrystone benchmark work?	10:13
wallento	no, I don't	10:18
ZipCPU	wallento: Thanks.	11:48
ZipCPU	kc5jta: I got that much, but the question really comes down to how do you separate logic across clocks?	11:52
ZipCPU	Do you increase the number of clocks a piece of logic takes, or slow down the clock to handle the extra logic?	11:52
manaar	when I am loading LINUX in the de0_nano board via telnet I am facing core dumped error given as - Info : JTAG tap: or1k.cpu tap/device found: 0x020f30dd (mfg: 0x06e, part: 0x20f3, ver: 0x0) Error: or1k_assert_reset: implement me Error: or1k_deassert_reset: implement me openocd: driver.c:191: interface_jtag_add_dr_scan: Assertion `field == out_fields + scan->num_fields' failed. Aborted (core dumped)	13:18
manaar	please help me to sort out this problem..	13:19
kc5tja	ZipCPU: The former approach implies deeper pipelines, while the latter implies wider fan-outs on gates (which increases loads). Either way, you're increasing latency.	13:48
kc5tja	You can recover throughput in both designs: in the pipeline case, obviously, streamline your code so that more instructions sit between division and the use of the results it produces.	13:49
ZipCPU	Certainly! But the deeper pipeline might mean that nothing else gets delayed, whereas slowing the clock down will delay all of the logic running off of that clock.	13:49
kc5tja	In the latter approach, make it a separate functional unit, and talk to it via queues and use an asynchronous handshake to indicate when the division is done.	13:49
kc5tja	The deeper pipeline also means greater impacts on mispredicted branchs (if you predict them at all).	13:50
kc5tja	There's tradeoffs no matter which direction you choose; but, the right approach will work better for a certain class of applications.	13:50
kc5tja	(Heh, tautology of the month right there.)	13:51
ZipCPU	Yeah ... you were up late last night ;)	14:03
ZipCPU	<Rant>Remember how I messed up the timing miserably on the DDR3 memory controller so bad I'm having to start from scratch? Since starting from scratch, I just realized I'm working with a DDR3-1333 (9-9-9) memory rather than the DDR3-1600 (11-11-11) memory that I thought I was working with. Time to redesign and start over again. </Rant> Sigh.	14:11
jeremybennett	ZipCPU: Good luck with your Dhrystone work	14:22
ZipCPU	Thanks jeremybennett. I really appreciate your answers to my questions today!	14:22
ZipCPU	jeremybennett: I just started reading about superoptimization yesterday, while looking up your website. It looks like a wonderfully fun project, and I look forward to hearing how well you do!	14:23
jeremybennett	ZipCPU: It's all open. We're particularly keen to get others to pick up the work. How about a superoptimizer for OpenRISC?	14:24
ZipCPU	So ... how hard is it to make it work for newer CPUs?	14:24
ZipCPU	i.e.: CPUs that are not in the current set. Extending GCC is certainly doable, how difficult would it be to extend a superoptimizer?	14:26
kc5tja	ZipCPU: :( I feel for you. I just wish I could help, but I'm waaaay behind in my own project at the moment.	14:32
kc5tja	I wrote a tool (in Shen, because learning new languages is [almost] always fun) that I hope will let me realize my goals faster.	14:32
kc5tja	It translates a table-oriented state machine description mapping inputs to outputs into corresponding Verilog code.	14:33
kc5tja	Avoids case statements (thus avoiding priority encoder logic), and allows me to write "partial outputs" without fear of introducing latches (which consumes LCs behind my back and w/out my knowledge).	14:33
ZipCPU	kc5tja: I thought I might get a touch of compassion from you. :) Thanks for understanding.	14:34
kc5tja	It also lets me design multi-hot outputs, which I'm hoping will let me minimize the size fo the logic even further.	14:34
ZipCPU	"multi-hot outputs"?	14:34
ZipCPU	I'm all about minimizing logic size ... but ... can you explain please?	14:35
kc5tja	Normally, for a given state S and inputs I, you have one and only one "next state" Sn and a set of outputs to drive stuff. That's "single-hot" outputs.	14:35
kc5tja	Multi-hot outputs means a number of different input states can trigger a number of sets of outputs.	14:36
kc5tja	So, for example...	14:36
kc5tja	Most RISC-V instructions in my CPU take 4 cycles to execute. So, depending on the current instruction register value, and the current T-counter value, I trigger things like "load register address from Rs1", "load register address from Rs2", "set ALU function code to blah", etc.	14:37
kc5tja	However, at the same time, since none of this uses the bus, I can also pre-fetch the next value of the instruction register.	14:37
kc5tja	In effect, pipelining, but it's all controlled from a single batch of logic.	14:37
kc5tja	So, OP rd, rs1, imm12 or OP rd, rs1, rd2 (where OP is any ALU operation) can drive the ALU control buses and such with one batch of logic, and drive instruction fetch with another batch of logic.	14:39
kc5tja	However, Lx and Sx (loads and stores, where x determines operand size), can't do this, so I cannot just generalize "instruction fetches always happens during T0-T3."	14:39
ZipCPU	How is this different from traditional HDL programming? From Mealy/Moore state machines, etc.	14:39
kc5tja	It IS state machine design.	14:40
kc5tja	The convenience is how I express it.	14:40
kc5tja	Like I said, no more case statements. They're outta here.	14:40
ZipCPU	So it's all expressed as ... memories?	14:40
kc5tja	They take up so much room in the source text (>100 LOC) that it's nearly impossible to keep it all straight in my head.	14:40
kc5tja	It's expressed as a truth table.	14:41
kc5tja	But, more symbolic.	14:41
kc5tja	So, e.g.,	14:41
kc5tja	[on [[T 3'b000] ~irq ~reset valid] regadr_ir1 [nextT 3'b001]]	14:42
kc5tja	(excuse the S-expression syntax)	14:42
kc5tja	This is saying "when T = 3'b000 and not IRQ and not RESET and instruction valid, then set regadr to ir1 (which is a subfield fo the instruction register), and nextT to 3'b001.	14:42
kc5tja	I don't have to worry about setting unused outputs to 0 all the time.	14:42
kc5tja	I don't have to worry about unfulfilled case statements generating latches without my knowledge.	14:43
kc5tja	Elsewhere, I can have instruction fetch logic fire ont he same inputs.	14:43
kc5tja	[on [[T 3'b000] valid] adr_pc vpa size_2]	14:43
kc5tja	[on [[T 3'b001] ~ack] adr_pc vpa size_2 [nextT 3'b001]]	14:44
kc5tja	[on [[T 3'b001] ack] adr_pc vpa size_2 [nextT 3'b010]]	14:44
kc5tja	..etc..	14:44
kc5tja	I can focus on one functional aspect at a time.	14:44
kc5tja	And the expression is concise, yet expressive.	14:44
kc5tja	It's just a way of representing AND/OR logic in a better format that I hope will make things easier on me.	14:45
kc5tja	Because right now, Verilog is doing a piss-poor job of letting me progress beyond a certain size of logic complexity.	14:45
kc5tja	I hope to have a real-world and (hopefully) working example come tomorrow.	14:46
ZipCPU	kc5tja: When I compare how you are building your CPU, to how I have built CPU's, I sort of need to ask: wouldn't it be a lot easier to handler logic complexity by splitting logic among clocks?	14:47
kc5tja	Basically, going to rewrite the instruction decoder for the S64X7 using this as a proof-of-concept demonstration.	14:47
kc5tja	I don't understand what you're saying.	14:47
kc5tja	I take four clock cycles to execute the average CPU instruction.	14:47
ZipCPU	That certainly sounds reasonable, I'm doing five cycles and one of those cycles is due to CPU specific logic.	14:48
kc5tja	The problem is scoping the complexity of the logic.	14:48
kc5tja	RISC has a LOT of moving parts, which isn't often discussed.	14:48
ZipCPU	I'm just remembering your experience with the RISC-V CPU instruction decoder, and the massive number of combinatorial logic statements that appeared to require.	14:48
kc5tja	Yes; there are 24 classes of instructions that need to be decoded all total.	14:49
kc5tja	Within each class, there are edge cases which trigger illegal instruction traps.	14:49
ZipCPU	24 classes of instructions?? Ouch! I'm working with 26 instructions!	14:49
kc5tja	E.g., shifting left by 33 bits on a 32-bit RISC-V architecture (or 65 bits on a 64-bit) will trigger an illegal instruction trap.	14:49
kc5tja	Well,	14:50
kc5tja	loads are one class.	14:50
kc5tja	stores are another.	14:50
kc5tja	OP and OP-imm are two more classes.	14:50
kc5tja	If you're implementing a 64-bit CPU like I am, you'll need OP32 and OP-imm-32 classes, so there's two more.	14:51
kc5tja	The SYSTEM class of instructions is what provides things like system-call, return-from-trap-handler instructions, read-modify-write CSR registers, etc.	14:51
kc5tja	Memory synchronization is another class. (These, thankfully, are just NOPs in my design.)	14:52
kc5tja	AUIPC, LUI, JAL, and JALR are four more classes.	14:52
ZipCPU	Let's see, I have a couple classes: ALU, memory, divide, FPU, and other. Anything that writes back gets mapped to one of the big four. (Memory synchronization is part of the "other")	14:52
kc5tja	So there's, what, 12 classes right there.	14:52
kc5tja	A single 'class' of instructions can have typically up to 8 instructions, though sometimes it can have 10.	14:53
ZipCPU	I suppose this sort of makes sense: there's load word, load byte, load halfword, in a "load" class ...	14:54
kc5tja	Still, that mass of combinatorial logic you saw was all because I needed to decode whether or not I had a valid instruction.	14:54
ZipCPU	(It was ugly ...)	14:54
kc5tja	Because RISC-V is pretty strict about trapping on illegal instruction forms.	14:54
kc5tja	The idea is that RISC-V is exceptionally well suited for virtualization, so anything which is not explicitly defined by an implementation or standard has to trap for virtualization purposes.	14:55
ZipCPU	Well ... after having a CPU go wild that didn't trap on illegal instructions ... and worse, it went wild through peripheral memory ...	14:55
ZipCPU	(my flash doesn't work the same anymore ...)	14:55
ZipCPU	I'm a strong supporter of illegal instruction detection ... now.	14:55
kc5tja	The one thing I hate about RISC-V though is how many different immediate operand forms it has.	14:55
kc5tja	I refuse to believe that saving a transistor or two in an ASIC is worth the confounding what-the-fsckery that is having 4 different forms of immediates.	14:56
kc5tja	20-bit and 12-bit, OK, I get it. 6-bit too for shifts.	14:56
kc5tja	But, 21-bit (with bit 0 forced zero to fit in a 20-bit space), 20-bit, 13-bit (ditto), 12-bit, 6-bit, and 5-bit immediates, not all of which are contiguously arranged, is just asking for complexity.	14:57
ZipCPU	Let's see: there's R, I, S, and U types ... right?	14:57
kc5tja	RISC-V defines R, I, S, SB, U, and UJ instruction forms. However, again, there're those stupid edge cases to worry about, because (just one example) FENCE instructions use a specialized imm12 encoding.	14:58
kc5tja	Some FPU instructions also define an R3 form (4-operand instead of 3-operand instruction).	14:59
kc5tja	The number of immediates goes up by FOUR if you consider the compressed instruction format as well.	14:59
kc5tja	So it's all over the map, and it makes software emulation needlessly slow.	14:59
ZipCPU	Have you started back on your implementation of RISC-V yet, or are you still working on your 64-bit container CPU?	15:01
kc5tja	S64X7 is essentially done, but it's still too big. 6400-ish LUTs (give or take about 200 LUTs. Haven't compiled in a while.)	15:01
kc5tja	Part of the reason for my writing this tool is to see how effective it is at reducing the size of the S64X7 (if at all), OR, failing that, to see if I can resume work on Polaris CPU and actually get to a point of completion.	15:05
ZipCPU	Are those 3-LUTs?	15:06
kc5tja	3-input look-up-tables.	15:07
kc5tja	(usually a 4LUT with a wasted input)	15:07
ZipCPU	Okay, that makes a lot of sense. Consider me cheering you on from the side lines. Although ... I've got to run now for lucnh. Back in a bit.	15:08
kc5tja	Ditto; lunchtime here too.	15:08
kc5tja	back	16:53
ZipCPU	Likewise, but I'm trying to pretend I'm working now ;)	16:53
ZipCPU	Sometimes I wish I could share all the markups I've made to the DDR3 specification. It makes it much easier to read, and I'd love to distribute it with my DDR3 controller, but JEDEC is controlling the specification to such an extent that it cannot be freely exchanged.	21:37
--- Log closed Fri Aug 19 00:00:32 2016

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!