--- Log opened Sat Jul 16 00:00:41 2016 | ||
kc5tja | ZipCPU: I think I fixed the ISE problem. Looks like the distribution was missing a libQt_Network.so dependency. | 00:29 |
---|---|---|
kc5tja | Crap! It still doesn't launch from the GUI though. :( | 00:31 |
kc5tja | This is really quite frustrating. | 00:32 |
kc5tja | AH HA!! I had to source a shell script before launching the ISE editor. Would have been nice if Xilinx had told me this before!! | 00:36 |
ZipCPU|Laptop | kc5jta: Okay, that one I could've told you. I've got a script I use to start up ISE and that's most (all) of the script: run the shell script, then ISE. | 08:13 |
ZipCPU | Okay, so this is crazy--I've debugged computers for more than 30 years, and I've never seen this bug pattern before: | 08:58 |
ZipCPU | 1) Sometimes the function prolog allocates space on the stack, sometimes it doesn't. | 08:58 |
ZipCPU | 2) A small irrelevant change to the program can keep this from happening. (A heisenbug!) | 08:59 |
ZipCPU | 3) If you load a program filled with nothing but NOOP's, the problem is guaranteed. (as long as you don't make the irrelevant change ...) | 08:59 |
ZipCPU | At this point, I think I have a cache bug--where the NOOP (the last program) is getting run rather than the new one. | 08:59 |
ZipCPU | I've just never seen this pattern before. | 08:59 |
Laksen | Anyone here know of any nice pretty generic instruction fetchers written in verilog? | 16:12 |
ZipCPU|Laptop | Laksen: I just finished a lot of work on a fairly generic instruction fetcher, written in Verilog. | 16:37 |
ZipCPU|Laptop | I'm not sure if it is "nice" and "pretty" enough for you, but it is (currently) fully functional. | 16:38 |
ZipCPU|Laptop | Barring the last few modifications, 1) it can run with a 200MHz clock on an Artix-7, 2) it combines the cache with the instruction fetch, and so 3) it can support early branching with only a single stall cycle when jumping to somewhere in the cache. | 16:40 |
Laksen | Functional is a lot better than not functional :) | 16:40 |
Laksen | Can I have a look? | 16:40 |
ZipCPU|Laptop | You can find it in the xula25soc project on open cores, I just checked my work in. The particular fetch you are looking for can be found in trunk/rtl/pfcache.v | 16:40 |
ZipCPU|Laptop | Oops ... better make that trunk/rtl/cpu/pfcache.v. | 16:41 |
ZipCPU|Laptop | There's a less traditional pre-fetch cache in there as well, representing my first attempt at building such. That one is called 'pipefetch.v'. | 16:41 |
ZipCPU|Laptop | Pipefetch works by trying to maintain a window in memory around the program counter. Jumps outside the window reset the window starting from the new location. | 16:42 |
ZipCPU|Laptop | Needless to say, I abandoned pipefetch for the better performance of pfcache, still ... it's a unique approach. | 16:43 |
Laksen | I'll have a look. I was thinking of a similar approach, but I just want to get the stuff running for now :) | 16:44 |
ZipCPU|Laptop | Are you using a wishbone bus? | 16:45 |
Laksen | No, AXI | 16:45 |
ZipCPU|Laptop | Well that will be one difference. | 16:45 |
ZipCPU|Laptop | Another may have to do with instruction width. This prefetch cache was designed for 32-bit instructions. | 16:45 |
Laksen | A small adapter should be fine. I dont' care too much about the latency | 16:46 |
Laksen | It's for a bog standard risc-v so 32bit is perfect | 16:46 |
ZipCPU|Laptop | You ... <GASP> ... don't care about <gulp> latency? ;) This whole approach was built to cut my latency down. <Grin> | 16:46 |
Laksen | At some point it becomes a concern, but this is just a fun vacation project to investigate extreme pipelining :P | 16:47 |
ZipCPU|Laptop | Really? Sounds cool! ... how extreme are we talking about? | 16:48 |
Laksen | Got my ALU ready which can almost run at >500 MHz on a Artix 7 | 16:48 |
Laksen | 64 bit | 16:48 |
ZipCPU|Laptop | Gosh, it took me a bit to get my 32-bit ALU able to run at 200MHz on an Artix-7--and you are headed for 500MHz?? | 16:49 |
Laksen | It synthesizes at 480 MHz where all the IO are tied directly to IOB's (giving an extra 0.8 ns delay) | 16:50 |
Laksen | 8 pipeline cycles though... so bad code will not be fast at all | 16:50 |
Laksen | 200 MHz, is that a single cycle pipeline? | 16:51 |
ZipCPU|Laptop | Laksen: Sorry to run off so quickly and unannounced--the dogs blessed the floor, and the basement staircase started flooding, and ... | 17:17 |
ZipCPU|Laptop | Life is now good again. | 17:17 |
ZipCPU|Laptop | 200MHz is not a single cycle pipeline. 200MHz was going to be a 9-stage pipeline. How you get up from that speed to 400+MHz I don't know. | 17:18 |
ZipCPU|Laptop | This is my first attempt at a "high speed" FPGA design, so ... I'm learning a lot in the process about what high speed requires. | 17:18 |
Laksen | Ah okay | 17:19 |
Laksen | The ALU alone in my design is 8 stages. So in the end it'll probably be 8+fetch+decode+opfetch+mem(n) | 17:20 |
ZipCPU|Laptop | Okay, so I'm two stages for the ALU, unless the instruction requires a multiply--that will take longer. | 17:21 |
Laksen | Each ALU stage does an 9 bit add, and single shift. Besides that I've spread out all the different logic operations over the different alu stages | 17:21 |
ZipCPU|Laptop | How are you handling pipeline conflict detection? | 17:21 |
ZipCPU|Laptop | Sorry, "pipeline hazard" detection--just remembered the proper term. | 17:21 |
Laksen | I keep a tally by orring onehots of all output registers in flight. Any that conflict will stall the pipeline. So nothing fancy | 17:22 |
Laksen | Simple forwarind for the end of the alu stage | 17:22 |
ZipCPU|Laptop | What if two instructions both use the same register as an output, but no inputs use that register? | 17:23 |
Laksen | No problem in that case | 17:23 |
Laksen | Oh wait. That's actually a problem I don't handle | 17:23 |
Laksen | Thanks for asking :P | 17:23 |
ZipCPU|Laptop | Sure! That's one of the approaches I have been considering, and the problem I mentioned is one I'm ... struggling with. | 17:25 |
Laksen | I've been dreaming many years of solving this problem programmatically | 17:26 |
ZipCPU|Laptop | You mean in software?? As in, in the compiler? | 17:26 |
Laksen | Doing dynamic compilation of a binary into RTL, specifically for processing pipelines | 17:27 |
ZipCPU|Laptop | By "dynamic compilation", are you referring to instruction reordering inside the CPU? | 17:28 |
Laksen | Basically write a program in a highlevel language that describes all paths through a CPU, and then execute that program symbolically | 17:28 |
Laksen | Where you create a bunch of mappings between registers and IO, memory and register ports | 17:28 |
ZipCPU|Laptop | I'm not sure I follow ... | 17:29 |
ZipCPU|Laptop | Is there a paper describing your approach? | 17:29 |
Laksen | Let me find an example | 17:29 |
Laksen | No | 17:29 |
Laksen | It's a novel methodology but I worked with this a lot on my master thesis, just in the wrong direction :) | 17:30 |
ZipCPU|Laptop | Are you working from within Academia? | 17:31 |
Laksen | Not any longer | 17:31 |
Laksen | This is just sparetime work :) | 17:31 |
Laksen | Here's an example: http://pastebin.com/4mg5jBRt | 17:31 |
Laksen | It might help the understanding that this is a basic RISC-V emulator | 17:32 |
Laksen | The language it's written in doesn't matter. In fact this is written for a pascal compiler that compiles to Risc-V | 17:33 |
Laksen | But that doesn't matter | 17:33 |
Laksen | All that matters is that it's symbolically executed | 17:33 |
Laksen | The code in the bottom is the initialization. It starts up a clocked task that's assumed to run once per clock | 17:33 |
Laksen | And finish at some point | 17:33 |
Laksen | Memories(2D) and registers(1D) are created before that | 17:34 |
Laksen | Memories and registers can be accessed by reads or writes | 17:34 |
Laksen | At a low level in the symbolic execution those are performed by system calls, so they are easy to figure out | 17:35 |
Laksen | Conditional branches are used to propagate information about when those are performed | 17:36 |
ZipCPU|Laptop | Okay, so ... if this is a basic emulator, ... why would you need a Verilog prefetch? | 17:36 |
ZipCPU|Laptop | (Just curious ...) | 17:36 |
Laksen | So for example register storages have an attached condition based on the path through the program that store took. | 17:36 |
Laksen | Ohh. This is an entirely different project :P | 17:36 |
Laksen | Sorry, just spilling my brain here :P | 17:37 |
ZipCPU|Laptop | Oh ... Ok. You had me confused. | 17:37 |
ZipCPU|Laptop | Something about a "RISC-V emulator" and "> 400 MHz" just ... didn't quite add up. ;) | 17:38 |
Laksen | Well I get too enthusiatic about dynamic recompilation and automatic pipeline construction somtimes :| | 17:38 |
Laksen | But the pipeline is real though, very simple :) http://pastebin.com/6K8761tu | 17:39 |
Laksen | Don't know yet what the registerfile accesses will be, but I think it can run far above 500 MHz if those don't slow it down | 17:40 |
ZipCPU|Laptop | On an FPGA, or in dedicated (ASIC) hardware? | 17:41 |
Laksen | Aiming for Artix 7 | 17:41 |
ZipCPU|Laptop | Will you publish your results anywhere? | 17:41 |
kc5tja | Meanwhile, I'm having an impossible condition: a boolean expression where all inputs are well defined, yet Verilog insists the result is 'x'. >:( | 17:42 |
Laksen | Sure | 17:42 |
ZipCPU|Laptop | I'd love to read about it. | 17:42 |
ZipCPU|Laptop | Hello, kc5tja, welcome back. | 17:42 |
ZipCPU|Laptop | kc5tja: Have you tried running your code through Verilator? | 17:42 |
Laksen | Or XST. IVerilog and Yosys both accepted my old code, but the xilinx synthesizer threw a synthesis time error | 17:43 |
kc5tja | No, largely because Verilator confuses me to no end. | 17:43 |
ZipCPU|Laptop | To Verilate, just do "verilator -cc toplevelverilog.v". | 17:44 |
ZipCPU|Laptop | I'm not going to recommended necessarily going farther than that, but Verilator does include some tremendous code checking capabilities, that have found bugs ISE and Vivado have let slip. | 17:45 |
Laksen | ZipCPU|Laptop, by the way, which WB interface is your pfcache using? | 17:45 |
Laksen | B3/B4 pipeline/no pipeline? | 17:46 |
ZipCPU|Laptop | B4, pipelined. | 17:46 |
ZipCPU|Laptop | You gotta do pipelined--that way you get one access per clock. Otherwise, you've crippled your bus. | 17:46 |
ZipCPU|Laptop | Just ... let the user beware ... you can't cross devices. | 17:46 |
Laksen | I agree, but I got to say I like the crispiness of AXI a lot more | 17:47 |
Laksen | There are too many loose ends in Wishbone :/ | 17:48 |
ZipCPU|Laptop | I haven't used AXI that much. How is it better (worse)? | 17:48 |
Laksen | In AXI it's always pipelined | 17:48 |
Laksen | The transactions are so easy to understand, because it's all built on handshaking on 5 channels | 17:48 |
Laksen | Bursts are optional, but are handled precisely the same. Transactions are layered on top | 17:49 |
kc5tja | Verilator won't even compile my code; I'm apparently much too modern for it at Verilog 1995. | 17:49 |
ZipCPU|Laptop | kc5tja: Not likely. You might wish to take a closer look at what it complains about. | 17:49 |
Laksen | Can you pastebin the problematic code? | 17:49 |
ZipCPU|Laptop | I'd love to take a look myself. | 17:50 |
kc5tja | It tells me quite explicitly that Verilog 1995 keyword is not supported. :) | 17:54 |
kc5tja | In this case, wait(). | 17:54 |
kc5tja | https://gist.github.com/sam-falvo/71139ddfc4e9b80c47e3fcce18e1f500 | 17:56 |
Laksen | Why not just do a @(posedge clk_o); @(negedge clo_o); | 17:58 |
Laksen | Never heard about the wait keyword before | 17:58 |
ZipCPU|Laptop | Is it synthesizable Verilog? | 17:59 |
Laksen | No | 17:59 |
Laksen | Or maybe the problem is that x is non-zero | 17:59 |
kc5tja | Now Verilator tells me unexpected @. | 17:59 |
Laksen | So the condition will always be true after startup | 17:59 |
kc5tja | x means 'undefined' or 'unknown.' | 18:00 |
Laksen | @(posedge clk_o); should be a perfectly valid statement | 18:00 |
kc5tja | Which is hogwash, since *all* of the term's inputs are well defined. | 18:00 |
kc5tja | Nope. Verilator doesn't like it. | 18:00 |
kc5tja | No change in behavior in iverilog. | 18:01 |
ZipCPU|Laptop | "always @(posedge clk_o) story_o <= story;" is what you want. | 18:01 |
Laksen | Not really | 18:01 |
ZipCPU|Laptop | No? | 18:01 |
Laksen | It should work just fine as is | 18:01 |
Laksen | I use that stuff all the time | 18:01 |
kc5tja | I was hoping to avoid this, but I think I need to throw this into Xilinx ISE to see what it thinks, and let me run a simulation there. | 18:02 |
Laksen | Ah | 18:03 |
Laksen | You have a bunch of errors on line 55-60 | 18:03 |
Laksen | Iverilog complains about those | 18:04 |
ZipCPU|Laptop | Some parentheses would fix those easily. | 18:04 |
kc5tja | My version of iverilog does not. | 18:04 |
Laksen | No | 18:04 |
Laksen | state_o doesn't exist in the file | 18:04 |
Laksen | Implicit declaration | 18:05 |
kc5tja | What options do you provide to make iverilog detect these errors? Mine literally is silent about them. | 18:05 |
Laksen | I use a compiled version from the source repository | 18:05 |
kc5tja | I'm at 0.9.7 | 18:06 |
Laksen | I'm at 11.0 (devel) | 18:06 |
Laksen | I can't remember why I needed the upgrade, but it's way better | 18:07 |
Laksen | Supports Verilog 2012 even | 18:07 |
kc5tja | Thank you! | 18:07 |
Laksen | Oh right. It was because it had support for the $fatal function | 18:07 |
kc5tja | I passed (on a whim) -Wall and it found the defect. | 18:07 |
Laksen | Very nice for makefile testbenches :) | 18:07 |
kc5tja | OK, I got basic instruction fetching implemented. | 20:39 |
kc5tja | Next step, illegal instruction trap. | 20:39 |
kc5tja | Took longer than I expected; but, it at least is working and my basic design is known to not be fantasy. | 20:40 |
kc5tja | That was easier than I'd ever expected. | 21:12 |
kc5tja | Well, that's quite frustrating. | 21:55 |
kc5tja | iverilog needs qualification for a module's ports (e.g., input foo; wire foo;), while Xilinx will treat this as an error. | 21:55 |
ZipCPU | olofk: If you are interested in a TCP version of a simulated UART, my code is posted in OpenCores, xula25soc, trunk/bench/cpp. You'll want the two files, uartsim.cpp and uartsim.h. | 22:45 |
ZipCPU | They'll take as inputs the UART transmit from the FPGA, and send the results to a TCP port (if anyone's connected to it). Characters sent on that port to the simulator will be turned into UART wires on the receive, and so it works. | 22:46 |
ZipCPU | The only minor difficulty might be the form of the setup word--telling it the baud rate, number of bits per symbol, parity information, etc. | 22:46 |
--- Log closed Sun Jul 17 00:00:42 2016 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!