| --- Log opened Thu May 15 00:00:49 2014 | ||
| poke53281 | Yes, it's linked with libc as far as I know. | 00:02 | 
|---|---|---|
| dalias | _franck_, fyi, i have a pretty generic high-performance "C" implementation of memcpy that does the word shuffling for alignment, as part of musl libc | 00:46 | 
| dalias | it's modeled after the way you'd do it in asm for most risc isa's, and is competitive with the asm implementations for most (iirc it's something like 80-90% of the speed of android's memcpy.s on arm) | 00:47 | 
| pgavin | dalias: I suppose it could be compiled down and tweaked by hand | 01:07 | 
| dalias | yes | 01:09 | 
| dalias | i suspect it's closer to optimal on other archs | 01:11 | 
| dalias | iirc it was the compiler's failure to generate huge ldm/stm's that kept it from being optimal on arm | 01:12 | 
| dalias | amusingly it's pathologically bad on microblaze in the default gcc config | 01:12 | 
| dalias | because minimal microblaze lacks a barrel shifter and gcc defaults to not generate shift instructions except <<1 and >>1 | 01:13 | 
| dalias | so it generates unrolled >>1 and <<1 loops for the 8-, 16-, and 24- shifts :-p | 01:13 | 
| pgavin | lol | 01:29 | 
| pgavin | yeah, I figure the worst thing there will be in the gcc output is some stack references that can be removed | 01:29 | 
| pgavin | gcc does pretty well | 01:30 | 
| pgavin | still some problems I'm trying to figure out :) | 01:30 | 
| stekern | _franck_: Linux doesn't use libgcc for memcpy in the generic case: http://lxr.free-electrons.com/source/lib/string.c#L589 | 03:29 | 
| stekern | even though that isn't, as dalias said, you can make a pretty efficient one in C | 03:29 | 
| stekern | but, in the kernel, there are copy_to/from_user, that have the exception table entries in them, so they kind of have to be in asm | 03:32 | 
| stekern | and if you have written an efficient copy_to/from_user (we haven't), then you basically have an efficient memcpy too | 03:33 | 
| stekern | (for reference, here's our copy_to/from_user: http://lxr.free-electrons.com/source/arch/openrisc/lib/string.S#L36) | 03:34 | 
| pgavin | stekern: this is strange... I've removed the entire pipeline description from the or1k.md file and the code performs better | 04:10 | 
| pgavin | every time I add more info to the description it gets worse | 04:10 | 
| pgavin | I'm tempted to commit a patch the removes the description that's there | 04:12 | 
| stekern | funny... | 04:24 | 
| stekern | maybe our pipeline implementation in mor1kx is all backwards ;) | 04:25 | 
| pgavin | yeah, I don't know | 04:25 | 
| pgavin | it's annoying | 04:25 | 
| pgavin | so for or1k-headers, I'm going to write an xml file that descripts the sprs, and a python script to create a header from it | 04:26 | 
| stekern | do you have an idea what changes with the pipeline description that makes it worse? | 04:27 | 
| pgavin | then if other formats are needed (verilog) it's easy to port | 04:27 | 
| pgavin | the pipeline description seems to make the code bigger, for one | 04:27 | 
| pgavin | that's probably the biggest thing | 04:27 | 
| pgavin | but it even executes fewer instructions in or1ksim | 04:28 | 
| pgavin | not many fewer, but some | 04:28 | 
| stekern | I think that's the first good use of xml IƤve ever heard ;) | 04:28 | 
| pgavin | lol | 04:28 | 
| pgavin | I have a similar thing for another cpu I'm working on | 04:28 | 
| pgavin | part of my dissertation :) | 04:28 | 
| stekern | usually when somebody mentions xml, I go 'ick', but for this... I think that makes sense ;) | 04:29 | 
| pgavin | yeah, it worked out well for that one | 04:29 | 
| pgavin | so I figured it would be ok here too :) | 04:29 | 
| stekern | hmm, so have you tried increasing cache sizes? | 04:31 | 
| stekern | what kind of benchmarks do you run? | 04:31 | 
| pgavin | I just have 2 really | 04:33 | 
| pgavin | I got frustrated after trying two of them | 04:33 | 
| pgavin | maybe I should do more | 04:34 | 
| pgavin | one is qsort on an array of strings and one is crc | 04:34 | 
| pgavin | I figured both should be easy to optimize | 04:34 | 
| pgavin | I didn't increase the cache size | 04:34 | 
| pgavin | but I don't think that should matter | 04:34 | 
| pgavin | the datasets are bigger than the cache could ever be in any case | 04:35 | 
| stekern | ok, but that should be unrelated to icache | 04:35 | 
| pgavin | true | 04:36 | 
| pgavin | both algorithms should fit in the cache | 04:36 | 
| stekern | I'm just thinking if it's *just* related to code size growth | 04:36 | 
| pgavin | I don't think so | 04:36 | 
| stekern | or is the actual code larger *and* slower | 04:36 | 
| pgavin | well I mean I don't think the code is growing to where it's bigger than the cache | 04:36 | 
| stekern | ok | 04:37 | 
| pgavin | it's only growing by a few hundred bytes in the worst case | 04:37 | 
| stekern | larger code of course means slower code, even if it fits in cache | 04:37 | 
| pgavin | yes | 04:37 | 
| pgavin | well maybe 200 bytes is enough | 04:37 | 
| pgavin | any way to enable statistics on the simulator? other than cycles I mean | 04:38 | 
| stekern | that's a lot if the original code was 10 bytes ;) | 04:38 | 
| pgavin | indeed :) | 04:38 | 
| pgavin | but I diffed the assembly output | 04:38 | 
| stekern | I guess it wasn't though | 04:38 | 
| pgavin | and it didn't look unreasonable | 04:38 | 
| pgavin | even the simplest description seems to make things worse though | 04:39 | 
| stekern | interesting | 04:39 | 
| stekern | I've been running coremark to benchmark mor1kx a lot, but I've been as lazy as you when it comes to diversity in the benchmarks | 04:40 | 
| pgavin | well it seems to me that my optimization shouldn't make something as common as qsort perform worse | 04:41 | 
| pgavin | and if it does, I don't want to commit it | 04:41 | 
| stekern | coremarks data sets and code fits in around 8KB of cache | 04:41 | 
| stekern | yeah, I agree | 04:42 | 
| pgavin | is coremark freely available? | 04:42 | 
| pgavin | ah, eembc | 04:43 | 
| pgavin | have to pay | 04:43 | 
| stekern | yes, you need to download the 'core' code from eembc | 04:43 | 
| stekern | +but | 04:43 | 
| stekern | afaik, it's free of charge to download | 04:43 | 
| pgavin | ah, ok | 04:43 | 
| pgavin | cool | 04:43 | 
| stekern | at least, I haven't payed them anything ;) | 04:43 | 
| pgavin | I'll use this then | 04:43 | 
| stekern | let me give you the or1k port | 04:44 | 
| pgavin | ok | 04:44 | 
| stekern | http://oompa.chokladfabriken.org/openrisc/or1k.tar.gz | 04:46 | 
| pgavin | thx | 04:46 | 
| stekern | what does that "pipeline" description actually model? | 04:53 | 
| stekern | there's that comment above them that says: "I think this is all incorrect for the OR1K. The latency says when the result will be ready, not how long the pipeline takes to execute." | 04:54 | 
| stekern | and why is the "alu_unit" assigned 2? | 04:56 | 
| pgavin | the latency is the default number of cycles for a true dependency on the result | 04:56 | 
| pgavin | not sure | 04:56 | 
| pgavin | I tried setting it to 1 | 04:56 | 
| pgavin | don't recall if it helped now | 04:56 | 
| pgavin | I think it did better | 04:56 | 
| pgavin | but removing the whole description was the best | 04:57 | 
| stekern | what does other archs do with that`? | 04:57 | 
| stekern | I'm all for removing it, if we should have one, it should be implementation specific ones, not arch specific | 04:58 | 
| stekern | especially if removing it makes better results ;) | 04:58 | 
| pgavin | the mips arch does similar stuff | 05:02 | 
| pgavin | I took a look at it | 05:02 | 
| pgavin | I thought it was a straightforward thing to do lol | 05:02 | 
| pgavin | these are the docs for it: https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html | 05:03 | 
| pgavin | yeah, I agree that it's ok for the default to assume 1 cycle/instruction and essentially optimize for size | 05:05 | 
| pgavin | but would be nice to teach it a bit about the implementations we have | 05:05 | 
| stekern | yeah, it kind of sounds crazy that removing it would *increase* performance... | 05:05 | 
| pgavin | yeah, that's what I thought lol | 05:05 | 
| stekern | unless some of our instructions are marked wrongly | 05:07 | 
| stekern | but I guess your disasm comparisons should have revealed that | 05:07 | 
| pgavin | the changes to the generated code were more significant the more complex my description was | 05:09 | 
| pgavin | which I suppose makes sense | 05:09 | 
| pgavin | in some cases it was hard to figure out what it was doing | 05:09 | 
| pgavin | so I tried making tiny changes, and they looked ok on paper I guess | 05:09 | 
| pgavin | but the result was slower | 05:09 | 
| stekern | weird... | 05:10 | 
| pgavin | maybe I should try some smaller benchmarks | 05:11 | 
| pgavin | like, a few nested loops or something | 05:12 | 
| stekern | what does the *3 in this do? or1k_alu*3 | 05:18 | 
| pgavin | it means it spends 3 cycles in the alu | 05:19 | 
| pgavin | if you have unit1+unit2 it means it spends a cycle in both of those units | 05:19 | 
| pgavin | and unit1,unit2 means a cycle in unit1, then a cycle in unit2 | 05:19 | 
| stekern | ah, ok | 05:19 | 
| pgavin | and "," binds tighter than "+" | 05:20 | 
| pgavin | and parens are allowed | 05:20 | 
| stekern | so that *3 is a lie, at least for mor1kx cappuccino | 05:20 | 
| pgavin | yes | 05:21 | 
| pgavin | the current description is definitely wrong imo | 05:21 | 
| pgavin | which is why I guess it helped when I removed it | 05:21 | 
| stekern | I think they roughly describes or1200 with a serial multiplier... | 05:22 | 
| pgavin | ok | 05:22 | 
| stekern | if not removing them, at least making them match something a "normal" pipeline would do is probably better | 05:22 | 
| pgavin | does the or1200 need 2 cycles for alu ops? | 05:22 | 
| pgavin | yeah, I tried that | 05:23 | 
| pgavin | it was worse | 05:23 | 
| pgavin | and I added some bypasses, which didn't seem to help | 05:23 | 
| stekern | worse than what's there now, or worse than removing it all together (or both)? | 05:23 | 
| pgavin | worse than both | 05:23 | 
| stekern | haha, wut??? | 05:23 | 
| pgavin | yeah, adding a description of what you'd think a standard or1k pipe would do made it worse | 05:24 | 
| pgavin | maybe my description was just bad | 05:24 | 
| pgavin | I don't know | 05:24 | 
| stekern | sounds crazy... | 05:25 | 
| stekern | jeremypbennett: I think we could need some expert input on this, do you think you could poke Joern if he'd have a split second to spare? | 05:25 | 
| pgavin | this is more-or-less what I tried: http://pastie.org/private/bygoojtfhlksdpsty0egka | 05:31 | 
| pgavin | which was worse | 05:31 | 
| pgavin | oh, I tried it without the bypass, too | 05:32 | 
| stekern | yeah, I figured you had | 05:34 | 
| pgavin | I tried it with some adjustments to the load/store stuff too | 05:34 | 
| pgavin | none of the versions I tried actually helped | 05:34 | 
| stekern | I at least would have thought that what you had in your paste would have been better... | 05:36 | 
| pgavin | based on the description in gccint.info, yes | 05:36 | 
| pgavin | and I had some other stuff for branches | 05:36 | 
| pgavin | I added a "flag" unit that was written by setflag, and read by branches | 05:36 | 
| pgavin | that's how I made it split the two up | 05:36 | 
| stekern | I reckon you tried mucking about with the mul_class value as well? | 05:37 | 
| pgavin | yes | 05:37 | 
| pgavin | I tried 1 cycle and 2 cycles | 05:37 | 
| pgavin | my attempts weren't very scientific I'll admit | 05:37 | 
| stekern | to be really correct, the div and mul should be divided in two seperate definitions | 05:37 | 
| pgavin | I didn't keep good track of what effect each change had :) | 05:38 | 
| pgavin | yes, that's true | 05:38 | 
| stekern | and in the mor1kx case, the div is 32 cycles | 05:38 | 
| pgavin | I don't think the code I ran had divide in ti | 05:38 | 
| pgavin | just mul | 05:38 | 
| stekern | ok, and in either case, the div instructions are marked as mul anyway, so it's not a change from the original desc | 05:39 | 
| stekern | what would be interesting to do would be to check if it's some particular change that makes it worse. But it's still just weird that it's worse.. | 05:43 | 
| pgavin | messing with the load/store had the biggest effect | 05:54 | 
| stekern | hmm, ok | 06:02 | 
| stekern | maybe the mor1kx is a bit uncommon, with the storebuffer being large and dumb | 06:03 | 
| stekern | ideally, with that setup you want loads being moved away as far as possible from stores | 06:04 | 
| stekern | and long sequences of stores are virtually of no cost | 06:04 | 
| stekern | no *extra* cost, I mean | 06:04 | 
| stekern | and by far away, I mean far away *after* the stores | 06:05 | 
| pgavin | hmm | 06:07 | 
| pgavin | good point | 06:07 | 
| olofk | pgavin: Regarding statistics in the simulator, I have been kickin' around an idea of dumping CPU state (GPRs, PC, etc) in an Sqlite db via VPI. Should be pretty easy, but haven't had time to try it out | 06:46 | 
| olofk | The thought came up when I tried to boot Linux in Icarus and got a several GB of plaintext logs after a few hours | 06:47 | 
| olofk | Having it in sqlite or something similar would enable a lot more post processing analysis as well | 06:48 | 
| olofk | It shouldn't be harder than creating a VPI wrapper around some sqlite functions, bake it into a core and hook it up in *or1k*-monitor instead of the $write and $display statements | 06:49 | 
| olofk | </brain dump> | 06:50 | 
| LoneTech | olofk: might be better to use $write into a pipe; VPI itself is very inefficient in some simulators | 07:01 | 
| olofk | LoneTech: really? Hmm.. I didn't know that. Maybe I should stop trying to accelerate too much stuff with VPI then | 07:03 | 
| LoneTech | it could be another architectural fault, of course, but I remember the vpi debug interface for orpsoc eating all the cpu it could | 07:04 | 
| olofk | LoneTech: That could be that it was looping waiting for data | 07:05 | 
| olofk | Not sure, but I also remember seeing that | 07:05 | 
| olofk | stekern: I closed bug #25 | 07:09 | 
| stekern | thanks | 07:11 | 
| olofk | LoneTech: Btw, I think that at least Icarus implements the verilog syscalls in a VPI module, so the overhead would be similar | 07:17 | 
| LoneTech | bbl, going to office | 07:23 | 
| stekern | https://ssl.serverraum.org/lists-archive/devel/2014-May/003745.html | 08:28 | 
| LoneTech | olofk: I think we can pretty safely reassign the blame for cpu usage on the amazing code structure of the orpsoc vpi debug thingy. it does things like emulating a blocking read in get_rsp_char | 12:14 | 
| LoneTech | though the way to feed external events into vpi isn't very neat either (looks like it's sim-driven polling or OS signals) | 12:19 | 
| LoneTech | hm. I can reduce the PLT scaling to two words per entry, at the extra cost of one jump per PLT call. very dependent on cache and call pattern if that's a gain | 15:24 | 
| LoneTech | I'm leaning towards no. the extra jump causes another delay slot, which can't do much useful, and might cost us an icache collission we don't need. | 15:29 | 
| _franck_ | (memcpy) should we have an arch specific memcpy then ? looks like most (all ?) of Linux arch that have a memcopy do it in assembler | 16:31 | 
| _franck_ | dalias: (memcpy) good to know. If I had to make an asm memcpy, I would compile it down and tweaked by hand like pgavin said | 16:35 | 
| _franck_ | well, memcpy complied with -O3 gives me 387 lines of assembly code... I hope there is some room for optimization | 16:44 | 
| blueCmd | stekern: where is my daily commit!? | 17:50 | 
| blueCmd | olofk: I glanced over the bugs you listed (or rather what is under binutils in bugzilla) and I couldn't find any that where 100% obvious that they are fixed | 17:58 | 
| stekern | blueCmd: your poor thing, I didn't feed you one all day yesterday! | 18:24 | 
| stekern | hopefully I'll get one together tonight | 18:25 | 
| stekern | I'm still poking at the spinlocks and I've decided on ticket spinlocks, right now I'm in the middle of stealing code from the arm port ;) | 18:26 | 
| blueCmd | yes, the arm port is good to steal from | 18:29 | 
| stekern | I usually like to glance at the arc port | 18:34 | 
| stekern | arm is often too obfuscated by all the different cpu options | 18:35 | 
| stekern | ...but I'm a slow stealer, since I'm not fluent in arm asm :( | 19:08 | 
| stekern | hmm, who can come up with the most optimal way to do: if ((var >> 16) == (var & 0xffff)) | 19:17 | 
| stekern | in or1k asm | 19:17 | 
| stekern | let's assume var is in r3 | 19:19 | 
| pgavin | l.srli r1,r3,16; l.andi r2,r3,0xffff; l.sfeq r1,r2 | 19:19 | 
| stekern | l.rori r4, r3, 16 | 19:19 | 
| stekern | l.sfeq r4,r3 | 19:20 | 
| pgavin | true | 19:20 | 
| stekern | but I guess we can't count on l.rori | 19:20 | 
| stekern | so, yours is probably the best we can do with class 1 insn | 19:21 | 
| _franck_ | I don't understand something in the calling of "void *memcpy(void *restrict dest, const void *restrict src, size_t n)" | 19:42 | 
| _franck_ | function parameters are r3, r4 and r5 right ? | 19:42 | 
| stekern | mmm | 19:42 | 
| _franck_ | it's not but as I'm reading the ABI (barely for the fiest time) it says it should | 19:44 | 
| stekern | what is not? | 19:44 | 
| _franck_ | because when I'm reading the asm code, I understand that r6 is tested rigth away and it's dest (or src, didn't look further) | 19:45 | 
| _franck_ | http://pastie.org/private/b10eogsdkimjnoxfaq7kw | 19:46 | 
| stekern | r6 = src & 0x3 | 19:47 | 
| olofk | stekern: Nice to hear about MiSoC | 19:47 | 
| _franck_ | I sould start learning asm ;) How does juliusb say ? | 19:47 | 
| * _franck_ slaps his fronthead | 19:48 | |
| stekern | I think it's forehead, but close enough ;) | 19:48 | 
| _franck_ | I wasn't sure but you got it :) | 19:49 | 
| olofk | Now I know what it feels to be slashdotted. It must just have been pure luck that Google's blog servers could handle all the nearly 3000 visitors I had yesterday | 19:53 | 
| olofk | I tried to test the code in bug #62. N00b question of today: Why do I get "Error: unresolved expression that must be resolved" | 20:47 | 
| olofk | Or is that the bug? :) | 20:51 | 
| juliusb | I can feel another forehead slap coming up... but I'll ask anyway | 21:16 | 
| juliusb | where is the memory map defined for the systems in fusesoc? | 21:16 | 
| juliusb | ah ok, not so obvious (so not worth of a slap this time), in <system>/data/wb_intercon.conf | 21:19 | 
| juliusb | So, any chance I could get an indication of how mature the de0_nano board in orpsoc-cores is? | 21:20 | 
| juliusb | ... and which software runs on it? u-boot? | 22:45 | 
| juliusb | rather, I know u-boot probably can run on it, but is there stuff which would be easy to roll out in a workshop which *just works* on that de0_nano build in orpsoc-cores? | 22:47 | 
| --- Log closed Fri May 16 00:00:50 2014 | ||
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!