--- Log opened Thu Sep 22 00:00:22 2016 | ||
olofk | Hoolootwo: Yeah, I've been bitten by that a few times. It's really annoying. We should fix it | 02:10 |
---|---|---|
Hoolootwo | at least for now, could something could be thrown in the readme? | 02:11 |
olofk | Hoolootwo: Will do. I forgot about it as I generally don't use the xterm at all, but connect via telnet instead | 02:11 |
Hoolootwo | ah okay, I will probably end up doing that too eventually | 02:12 |
olofk | I found it works a bit better than xterm | 02:12 |
olofk | The issue there is that if you boot linux, it will just stop at the prompt without any indication that it's waiting for a telnet connection | 02:13 |
olofk | I guess that or1ksim in general could be a bit more descriptive :) | 02:13 |
olofk | We should also make 32MB RAM default, so you don't strictly need to use a config file | 02:14 |
olofk | You know what, I'll file a few bugs so I don't forget | 02:14 |
olofk | ZipCPU|Laptop: I think a potential improvement could be to store the provider info outside of the core file, as a separate file. This has crossed my mind a few times, but there are of course some drawbacks to this approach too | 02:24 |
olofk | So for now, the general workflow I use myself is to store a .core file in the repo without a provider section. This is always up-to-date with the head of the repo | 02:25 |
olofk | For orpsoc-cores I prefer to store only proper releases and point to a specific version, tag, commit in the provider section | 02:26 |
olofk | And regarding MicroBlaze. I noticed the same thing a few years ago. It's a interesting architecture, and I'm not convinced it's all bad | 02:27 |
olofk | It allows you to have a high bandwidth connection to your RAM that doesn't have to wait for slow transfers on the peripheral bus | 02:28 |
Hoolootwo | from what I have seen for applications where you really need i/o, you do DMA on the microblaze | 02:28 |
olofk | There are however some complications. I worked with a dual-core setup that had one part of the RAM shared, which meant it couldn't be cached. That was a bit tricky to get right, since I had to feed a segment of the peripheral bus to the RAM | 02:29 |
olofk | Hoolootwo: Yes. If you're transferring large amounts of data | 02:30 |
olofk | But much of the I/O transfers are just small and slow accesses, like talking to a SPI COntroller or UART | 02:31 |
olofk | And having separate buses will avoid having the CPU wait for the bus to be free when it wants to talk to the RAM | 02:31 |
shorne | stekern: I am looking at jonas's change to sys_rt_sigreturn, I understand that he made the change to switch the return path from normal syscall return path to the exception return path. But its not that big of a difference as the syscall return patch checkes for workpending then jumps to the exception return path if there is pending work. | 03:43 |
shorne | So other than some restored registers its not too much different. Do you have any idea what jonas means when he said he reworked that patch? | 03:44 |
shorne | I didnt see any in my rebase. If you don't know Ill just send him a mail | 03:44 |
stekern | afair, that is the reworked patch, and he did get feedback from sebastian macke about it. But, I might be remembering wrong | 03:45 |
stekern | I'm pretty sure I would have picked up the latest version if there was a more recent one though | 03:46 |
shorne | yeah, I cant find anything in any history also the comment says "comment from the original patch" | 03:46 |
shorne | Do you know if it was discussed on the linux kernel mailing list before?" or just on the openrisc list? | 03:46 |
shorne | since the openrisc archive seems gone now | 03:47 |
stekern | just the openrisc list | 03:47 |
shorne | I see, ok might just have to shoot jonas a mail. I read through everything it seems ok | 03:47 |
stekern | I can try to forward the messages to you, I have them in my own archive | 03:48 |
shorne | that would be great if you can | 03:48 |
stekern | done | 03:50 |
shorne | Hmm, so it seems first patch still did return via syscall path, second did return via execption path | 03:58 |
shorne | but then Sebastian says strace is really broken, and jonas says he will look again | 03:58 |
shorne | so there might be a 3rd patch | 03:58 |
olofk | Jonas hasn't been active for about four years, so don't get your hopes up that he did another one | 04:01 |
shorne | yeah, I am kind of thinking that | 04:01 |
shorne | well, I guess I have to try this patch with strace and see if it breaks | 04:02 |
stekern | shorne: I strongly remember that there was no follow-up after that last mail, sebastian might remember if he did some more testing of it | 04:08 |
stekern | poke53281 <- sebastian | 04:08 |
shorne | stekern: thanks, its good to know you remember no updates. Interestingly it seems that thread is just between you Jonas and Sebastian | 04:10 |
shorne | looks like its off mailing list (just by the forwarded headers) | 04:10 |
stekern | that might very well be the case | 04:10 |
shorne | wayback machine only has lists.openrisc.net till 2012 | 04:11 |
shorne | it doesnt have the mails thoug, Ill test it | 04:14 |
shorne | https://www.mail-archive.com/linux@lists.openrisc.net/index.html#00432 | 04:17 |
shorne | I found this | 04:17 |
shorne | good | 04:17 |
shorne | those mails were definitely not on the list, anyway, I got work to do | 04:24 |
shorne | (not those patches) | 04:24 |
ZipCPU|Laptop | olofk: I can understand Xilinx's purpose in having four busses. 1. Separate instruction and data busses avoids a bottleneck, 2. Caching a bus that can be cached is an advantage, 3. Having a wide bus speeds things up with memory--especially since the default DDR3 width is 128bits. | 06:20 |
ZipCPU|Laptop | What gets me is that none of these busses are truly pipelined. | 06:21 |
ZipCPU|Laptop | They are heavy, feature laden beasts, that are (in terms of performance) inherently slow. | 06:21 |
wallento | about that core file provider, there is still the plan to bring up an api.librecores.org that provides all cores as .core files or others | 06:24 |
olofk | ZipCPU|Laptop: AXI4 is definitely pipelined | 06:26 |
ZipCPU|Laptop | Not the way Xilinx implemented it for their MicroBlaze. | 06:28 |
ZipCPU|Laptop | According to the docs, the only allow one request in flight at a time--even though the bus allows more. | 06:28 |
ZipCPU|Laptop | For the peripheral bus, that's one 32-bit word request. | 06:28 |
ZipCPU|Laptop | For the memory/cachable busses, that's one 128-bit request that may be pipelined if the bus is smaller. | 06:29 |
olofk | ah ok | 06:29 |
olofk | The problem with pipeline accesses is that you lose exact exceptions | 06:29 |
ZipCPU|Laptop | Not necessarily. I was reading through the LM32 wishbone spec yesterday, and they handled that this way: | 06:30 |
ZipCPU|Laptop | Every STB pulse gets either an ACK, ERR, or RTY signal in return. | 06:30 |
ZipCPU|Laptop | So, as long as the ERR signal doesn't come before your ACK signal (which could happen if you cross devices ...) your exceptions remain exact. | 06:31 |
olofk | But that is without pipelining | 06:32 |
olofk | You have to wait for the ack, err or rty to come back before sending another one | 06:33 |
olofk | This is basically what all non-pipelined wb masters does | 06:33 |
ZipCPU|Laptop | Why wait? The alternative is that you are prepared to roll back several operations. You need to maintain that information anyway, in order to know which register to place the result into for a read request. | 06:35 |
olofk | Yes. Rolling back is another option, but it's also more complex | 06:35 |
olofk | But I doubt that lm32 implements wb4 pipelined mode | 06:36 |
olofk | uBlaze sends a burst request to fill a cache line, when needed. It's the same thing we do with mor1kx | 06:39 |
olofk | I don't see how pipelining would help here | 06:40 |
olofk | (Can't believe I'm defending uBlaze after all the bad things I have said about it) :) | 06:41 |
ZipCPU|Laptop | So here's a question there: is the flash on the cache line, or just the DDR3 memory? | 07:02 |
ZipCPU|Laptop | QSPI flash can get a *big* benefit from pipelining. | 07:03 |
ZipCPU|Laptop | olofk: Regarding rollback ... for loads, I don't retire instructions until the memory operation is complete. There's nothing that needs to be rolled back as a result. Writes can be (but aren't yet) done the same way. | 07:04 |
ZipCPU|Laptop | Nothing truly then needs to be rolled back. | 07:05 |
olofk | Not sure how they do QSPI Flash. It's likely not pipelined, so either they read from the peripheral bus, or they DMA to the memory | 07:05 |
olofk | Do you have a cache? | 07:05 |
olofk | Without a cache, pipelined accesses are definitely a benefit | 07:08 |
shorne | olofk: any word from opencores.org? | 07:13 |
ZipCPU|Laptop | olofk: Currently, I have an instruction cache but no data cache. I also have no MMU, so concurrency is not a big issue for me (yet). I intend to fix/change both of these, but that hasn't happened yet. | 07:18 |
olofk | shorne: Not from what I have heard | 07:20 |
kc5tja | WithOUT a cache, pipelining is a benefit? Unless you're streaming data, that has been the case in my models. Software reads and writes to random locations in memory more frequently than not, so you have to incur the 70ns (for DRAM) penalty with sufficient frequency that SDRAM protocol overhead actually makes it a slower choice than asynchronous 70ns DRAM. | 10:47 |
kc5tja | s/has been the case/has not been the case/ | 10:47 |
kc5tja | QSPI is not pipelined; it is, however, a burst transfer device. | 10:48 |
kc5tja | It's protocol is a lot like SDRAM's protocol, only with more clock cycles. | 10:49 |
kc5tja | You send it the read command (which includes the address) in about 6 cycles or so, then you wait some more cycles (with no transfers) while the device accesses the flash contents, and then you start streaming data back. | 10:50 |
kc5tja | If a particular device does support some kind of pipelining, it's at most six cycles (or however many it takes to receive a command), which is likely to be but a tiny fraction of your burst length. | 10:51 |
Hoolootwo | I can't seem to get to opencores.org from any of my various locations, is it down for anyone else? | 14:44 |
Hoolootwo | dns seems to resolve, but no http gets through | 14:44 |
wallento | its dead | 14:59 |
Hoolootwo | how dead? | 15:00 |
Hoolootwo | gone forever, or should it be back relatively soon? | 15:01 |
mafm | it's not dead, it's pining for the fjords | 15:01 |
* mafm wearing a Cleese t-shirt right now, conveniently | 15:02 | |
Hoolootwo | mafm++ | 15:03 |
ZipCPU|Laptop | kc5tja: I intend to discuss the benefit of pipelining without a cache at ORCONF. Indeed, part of my presentation will show Dhrystone measures with and without pipelining. | 15:30 |
ZipCPU|Laptop | As for QSPI, if every access requires the six clocks for address plus two dummy clocks before data shows up, then you've just made my point. | 15:31 |
ZipCPU|Laptop | A first access in a group requires 8 clocks to start, then can produce one 32-bit data value every 8 clocks. | 15:31 |
ZipCPU|Laptop | If you string your operations together, sequentially, then you can read from the flash in 8+8N clocks. | 15:32 |
ZipCPU|Laptop | This is one form of "pipelining". | 15:32 |
ZipCPU|Laptop | As for SDRAM, the DDR3 SDRAM I'm working with will have an access time of (roughly) 9+N clocks. | 15:34 |
ZipCPU|Laptop | Pipelining lets you exploit the N instead of requiring that N be 1 every time. | 15:34 |
ZipCPU|Laptop | BTW: N is the number of 128-bit words you wish to read (or write--you just can't switch mid transfer without stalling) | 15:35 |
ZipCPU|Laptop | One more comment on the Dhrystone measure: that is with and without pipelining on the *data* channel. The *instruction* channel is both pipelined and cached as soon as the cache in the CPU is enabled, and hence the CPU is pipelined. (The option connects the two within the ZipCPU.) | 15:37 |
ZipCPU|Laptop | Indeed, I get a rough 50% improvement in my Dhrystone score by implementing pipelining ... even without a data cache. | 15:37 |
-!- Dan__ is now known as Guest63860 | 15:42 | |
-!- Guest63860 is now known as ZipCPU | 15:42 | |
kc5tja | ZipCPU|Laptop: Nice. I wish I could attend. :( | 15:50 |
kc5tja | re: 8+8N clocks -- that's not pipelining. That's bursting. Pipelining is when you can *start* transaction N+1 *before* transaction N completes. | 15:51 |
kc5tja | Otherwise, RS-232 communications is highly pipelined transmission of data. ;-) | 15:52 |
ZipCPU | Perhaps I'm not defining pipelining the same way. Hmm ... here's my definition: the controller never relinquishes control of the bus and would, barring stalls from the WB slave, issue one request per clock and then wait for one ack per request. | 15:56 |
ZipCPU | With that definition, RS-232 is not "pipelined" unless the wishbone master holds onto the bus and the RS-232 device stalls everything while waiting for its next byte. | 15:56 |
ZipCPU | Similarly, a more traditional RS-232 device would maintain a FIFO buffer, which could easily be set up for M+N clocks, where M is any bus propagation time and N is the number of transfers necessary to either fill the buffer or finish the message. | 15:57 |
kc5tja | Clocking doesn't really enter into any definition I've seen (only used as examples). | 16:02 |
kc5tja | For example, a CPU with a pipeline can still take 5 cycles to execute an instruction, but if it has a 5-deep pipeline, it can "appear" to have an instruction latency of 1 cycle. | 16:02 |
kc5tja | That's because it's busy processing 5 instructions at any given time (bubbles notwithstanding). | 16:02 |
kc5tja | But, a pipeline doesn't always have to be clock synchronous. The 80386 actually had a limited pipeline which allowed up to three instructions to execute at once, but the minimum latency was 2 cycles. | 16:03 |
kc5tja | Another example more relevant to Flash SQPI devices is the RapidIO interconnect. | 16:05 |
kc5tja | In as little as four clocks (but can be more depending on the kind of packet and how wide your interconnect is), you can kick off any bus transaction you like. | 16:05 |
kc5tja | In a couple of other clocks, you'll get back an acknowledgement that the packet was received by the network. | 16:06 |
kc5tja | However, that doesn't mean your transaction has completed. It just means that the interconnect is now free for another if you support that. | 16:06 |
kc5tja | The receipt of the "ok I'm done" for your transaction might come hundreds of clocks later. In the meantime, you can "queue up" a number of other transactions, some of which might even complete out of order (!!). | 16:07 |
ZipCPU | I still think "pipeline" is appropriate. This for two reasons: 1) the WB spec calls this type of access "pipelined", and 2) the bus (not necessarily the peripheral) is acting in a pipelined fashion -- even by your definition. | 16:07 |
kc5tja | However, what you _cannot_ do is break up an individual transaction's burst of data. | 16:07 |
kc5tja | I cannot agree with that definition. | 16:08 |
ZipCPU | Consider a bus with multiple stages within it. If there's one request in each stage, you then have a pipeline. | 16:09 |
kc5tja | (1) WB treats bus transactions as single-beat things, even in a pipelined implementation. The "pipeline" depth in the controller _must_ match the interconnect's register depth (plus the pipeline depth in the peripheral), or it will fall out of synchronization. | 16:09 |
ZipCPU | If the peripheral at the far end only accepts one access every x clocks, that doesn't negate the fact that the bus itself was pipelind. | 16:10 |
kc5tja | Yes, because the other x-1 clocks are impossible to use for a transaction. | 16:10 |
ZipCPU | Now wait a second here ... if a CPU has five pipeline stages, and whenever you perform a multiply the multiply stage takes 8 clocks, that doesn't mean the CPU isn't pipelined. | 16:12 |
kc5tja | That's not what I said. | 16:12 |
kc5tja | What makes it pipelined is the fact that you can have up to 5 instructions in flight at the same time. Key word: SAME time. | 16:13 |
ZipCPU | Then ... what have I missed? You offered a CPU as an example of what defined a pipeline, and I'm pointing out that any pipeline can stall. | 16:13 |
ZipCPU | Ok, but I can still have five bus transactions in flight at the SAME time, even if the peripheral at the end stalls the bus. | 16:13 |
kc5tja | Simply stuffing a stream of data down a pipe doesn't make it pipelined. What makes it pipelined is the _capability_ for that pipe to have _multiple_ and _independent_ transactions in flight at once. | 16:14 |
kc5tja | For example, if your CPU can request to read the next instruction from program space _before_ the currently executing instruction completes a data fetch from data space, then you have a pipelined bus. | 16:15 |
kc5tja | This is why I say flash QPI devices are bursted, not pipelined. You cannot start read #2 until read #1 has completely finished. | 16:16 |
ZipCPU | So ... if a CPU issues a write command to address 0, 1, 2, and 3, before being stalled by the peripheral that needs to wait 14 clocks before the first request completes, and 8 clocks for every request thereafter ... that's not a pipelin? | 16:16 |
ZipCPU | And then once that first request completes, the CPU issues a command to write address 4 -- even before 1, 2, and 3 have completed ... that's not a pipeline? | 16:17 |
kc5tja | Nope. That's just burst-mode with lots of wait-states. | 16:17 |
kc5tja | Wait, you just said that write to 3 stalls the CPU. | 16:17 |
ZipCPU | Yes. The 3rd write stalled the CPU, not the first two. | 16:18 |
kc5tja | I need to see a timing diagram, because this is too confusing to disentangle on IRC alone. | 16:18 |
ZipCPU | Do you have WB B4 spec available to you? | 16:19 |
kc5tja | Yes. It's on my desktop. | 16:19 |
ZipCPU | Okay, let's compare illustration 3-10 on page 49 with ... | 16:20 |
ZipCPU | 3-11 on page 51. | 16:20 |
ZipCPU | 3-11 is what I'm calling "pipelined" | 16:20 |
kc5tja | OK, that is pipelined by virtue of the fact that the address bus, WE, and other control signals changes value every (non-stalled) cycle. That is to say, EVERY cycle is potentially a unique read or write transaction. | 16:23 |
ZipCPU | Yes! | 16:23 |
kc5tja | What raised my objection is (let me type it out) | 16:23 |
kc5tja | Flash SQPI devices cannot support that mode of operation. At all. | 16:24 |
kc5tja | What they DO have, is a set of clock cycles where you send an address and a WE bit, | 16:24 |
kc5tja | followed by some access time latency, | 16:24 |
kc5tja | followed by one or more cycles of contiguously addressed data. | 16:24 |
ZipCPU | (Let me know when you are done ...) | 16:25 |
kc5tja | (heh, sorry, had an interruption at the door) | 16:27 |
kc5tja | But, all the while, it's one transaction. | 16:27 |
kc5tja | These 23 clocks (or whatever) all correspond to a _single_ WB bus cycle. | 16:27 |
kc5tja | I guess the spec calls them block transactions instead of burst transactions. | 16:28 |
kc5tja | Still got Motorola terminology in my brain. | 16:28 |
kc5tja | Does that make sense? | 16:28 |
ZipCPU | I think so ... perhaps our confusion is in the difference between the device itself and the controller. | 16:29 |
kc5tja | Now, can you USE pipelining for this operation? Absolutely. And I honestly would probably prefer it over block transactions because it seems to give more control over timing. | 16:29 |
kc5tja | Might be. Different things have different pipelines, which is semantically equivalent to different clock domains. | 16:30 |
kc5tja | Either way, no matter which terminology you use, I only ask that you be consistent with it. :) | 16:30 |
ZipCPU | So, the multiple QSPI controllers I've written have all been both internally pipelined (especially this last one), and used the pipeline bus mode. | 16:31 |
ZipCPU | I can understand why you might say, though, that the interface itself is not pipelined. | 16:31 |
* kc5tja nods | 16:32 | |
shorne | wallento: FYI, I am building musl with a host gcc version of 6.11. It seems that gcc-6 cannot build gcc-5 due to this: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69959 | 17:36 |
-!- Netsplit *.net <-> *.split quits: hammond, simoncoo1, andrzejr, eliask, ssvb, jeremybennett, wallento, Hoolootwo, SMDwrk, nurelin, (+7 more, use /NETSPLIT to show all of them) | 17:37 | |
shorne | wallento ... just got split | 17:37 |
shorne | great | 17:37 |
-!- Netsplit over, joins: eliask, simoncoo1, kc5tja, rokka | 17:38 | |
shorne | It seems we need to bump up to gcc 5.4.0 (which I did here) https://github.com/stffrdhrn/or1k-gcc/tree/musl-5.4.0 | 17:39 |
shorne | I just did a merge from gcc 5.4.0 release into or1k to create or1k-5.4.0, and then rebase the musl-5.3.0 on or1k-5.4.0, to create musl-5.4.0 | 17:41 |
shorne | wallento: repeat I did the bump here gcc 5.4.0 (which I did here) https://github.com/stffrdhrn/or1k-gcc/tree/musl-5.4.0 | 17:41 |
shorne | it was very smooth | 17:41 |
shorne | no conflicts | 17:41 |
shorne | now my musl build is running again | 17:42 |
shorne | Hoolootwo: what do you need from opencores? openrisc related data here: http://openrisc.io/, opencores repo here http://freecores.github.io/ | 18:46 |
Hoolootwo | shorne, I was just looking at the topic, and I keep finding broken opencores links on google about openrisc stuff | 23:09 |
--- Log closed Fri Sep 23 00:00:24 2016 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!