IRC logs for #openrisc Thursday, 2014-05-15

--- Log opened Thu May 15 00:00:49 2014
poke53281Yes, it's linked with libc as far as I know.00:02
dalias_franck_, fyi, i have a pretty generic high-performance "C" implementation of memcpy that does the word shuffling for alignment, as part of musl libc00:46
daliasit's modeled after the way you'd do it in asm for most risc isa's, and is competitive with the asm implementations for most (iirc it's something like 80-90% of the speed of android's memcpy.s on arm)00:47
pgavindalias: I suppose it could be compiled down and tweaked by hand01:07
daliasi suspect it's closer to optimal on other archs01:11
daliasiirc it was the compiler's failure to generate huge ldm/stm's that kept it from being optimal on arm01:12
daliasamusingly it's pathologically bad on microblaze in the default gcc config01:12
daliasbecause minimal microblaze lacks a barrel shifter and gcc defaults to not generate shift instructions except <<1 and >>101:13
daliasso it generates unrolled >>1 and <<1 loops for the 8-, 16-, and 24- shifts :-p01:13
pgavinyeah, I figure the worst thing there will be in the gcc output is some stack references that can be removed01:29
pgavingcc does pretty well01:30
pgavinstill some problems I'm trying to figure out :)01:30
stekern_franck_: Linux doesn't use libgcc for memcpy in the generic case:
stekerneven though that isn't, as dalias said, you can make a pretty efficient one in C03:29
stekernbut, in the kernel, there are copy_to/from_user, that have the exception table entries in them, so they kind of have to be in asm03:32
stekernand if you have written an efficient copy_to/from_user (we haven't), then you basically have an efficient memcpy too03:33
stekern(for reference, here's our copy_to/from_user:
pgavinstekern: this is strange... I've removed the entire pipeline description from the file and the code performs better04:10
pgavinevery time I add more info to the description it gets worse04:10
pgavinI'm tempted to commit a patch the removes the description that's there04:12
stekernmaybe our pipeline implementation in mor1kx is all backwards ;)04:25
pgavinyeah, I don't know04:25
pgavinit's annoying04:25
pgavinso for or1k-headers, I'm going to write an xml file that descripts the sprs, and a python script to create a header from it04:26
stekerndo you have an idea what changes with the pipeline description that makes it worse?04:27
pgavinthen if other formats are needed (verilog) it's easy to port04:27
pgavinthe pipeline description seems to make the code bigger, for one04:27
pgavinthat's probably the biggest thing04:27
pgavinbut it even executes fewer instructions in or1ksim04:28
pgavinnot many fewer, but some04:28
stekernI think that's the first good use of xml IƤve ever heard ;)04:28
pgavinI have a similar thing for another cpu I'm working on04:28
pgavinpart of my dissertation :)04:28
stekernusually when somebody mentions xml, I go 'ick', but for this... I think that makes sense ;)04:29
pgavinyeah, it worked out well for that one04:29
pgavinso I figured it would be ok here too :)04:29
stekernhmm, so have you tried increasing cache sizes?04:31
stekernwhat kind of benchmarks do you run?04:31
pgavinI just have 2 really04:33
pgavinI got frustrated after trying two of them04:33
pgavinmaybe I should do more04:34
pgavinone is qsort on an array of strings and one is crc04:34
pgavinI figured both should be easy to optimize04:34
pgavinI didn't increase the cache size04:34
pgavinbut I don't think that should matter04:34
pgavinthe datasets are bigger than the cache could ever be in any case04:35
stekernok, but that should be unrelated to icache04:35
pgavinboth algorithms should fit in the cache04:36
stekernI'm just thinking if it's *just* related to code size growth04:36
pgavinI don't think so04:36
stekernor is the actual code larger *and* slower04:36
pgavinwell I mean I don't think the code is growing to where it's bigger than the cache04:36
pgavinit's only growing by a few hundred bytes in the worst case04:37
stekernlarger code of course means slower code, even if it fits in cache04:37
pgavinwell maybe 200 bytes is enough04:37
pgavinany way to enable statistics on the simulator? other than cycles I mean04:38
stekernthat's a lot if the original code was 10 bytes ;)04:38
pgavinindeed :)04:38
pgavinbut I diffed the assembly output04:38
stekernI guess it wasn't though04:38
pgavinand it didn't look unreasonable04:38
pgavineven the simplest description seems to make things worse though04:39
stekernI've been running coremark to benchmark mor1kx a lot, but I've been as lazy as you when it comes to diversity in the benchmarks04:40
pgavinwell it seems to me that my optimization shouldn't make something as common as qsort perform worse04:41
pgavinand if it does, I don't want to commit it04:41
stekerncoremarks data sets and code fits in around 8KB of cache04:41
stekernyeah, I agree04:42
pgavinis coremark freely available?04:42
pgavinah, eembc04:43
pgavinhave to pay04:43
stekernyes, you need to download the 'core' code from eembc04:43
stekernafaik, it's free of charge to download04:43
pgavinah, ok04:43
stekernat least, I haven't payed them anything ;)04:43
pgavinI'll use this then04:43
stekernlet me give you the or1k port04:44
stekernwhat does that "pipeline" description actually model?04:53
stekernthere's that comment above them that says: "I think this is all incorrect for the OR1K. The latency says when the result will be ready, not how long the pipeline takes to execute."04:54
stekernand why is the "alu_unit" assigned 2?04:56
pgavinthe latency is the default number of cycles for a true dependency on the result04:56
pgavinnot sure04:56
pgavinI tried setting it to 104:56
pgavindon't recall if it helped now04:56
pgavinI think it did better04:56
pgavinbut removing the whole description was the best04:57
stekernwhat does other archs do with that`?04:57
stekernI'm all for removing it, if we should have one, it should be implementation specific ones, not arch specific04:58
stekernespecially if removing it makes better results ;)04:58
pgavinthe mips arch does similar stuff05:02
pgavinI took a look at it05:02
pgavinI thought it was a straightforward thing to do lol05:02
pgavinthese are the docs for it:
pgavinyeah, I agree that it's ok for the default to assume 1 cycle/instruction and essentially optimize for size05:05
pgavinbut would be nice to teach it a bit about the implementations we have05:05
stekernyeah, it kind of sounds crazy that removing it would *increase* performance...05:05
pgavinyeah, that's what I thought lol05:05
stekernunless some of our instructions are marked wrongly05:07
stekernbut I guess your disasm comparisons should have revealed that05:07
pgavinthe changes to the generated code were more significant the more complex my description was05:09
pgavinwhich I suppose makes sense05:09
pgavinin some cases it was hard to figure out what it was doing05:09
pgavinso I tried making tiny changes, and they looked ok on paper I guess05:09
pgavinbut the result was slower05:09
pgavinmaybe I should try some smaller benchmarks05:11
pgavinlike, a few nested loops or something05:12
stekernwhat does the *3 in this do? or1k_alu*305:18
pgavinit means it spends 3 cycles in the alu05:19
pgavinif you have unit1+unit2 it means it spends a cycle in both of those units05:19
pgavinand unit1,unit2 means a cycle in unit1, then a cycle in unit205:19
stekernah, ok05:19
pgavinand "," binds tighter than "+"05:20
pgavinand parens are allowed05:20
stekernso that *3 is a lie, at least for mor1kx cappuccino05:20
pgavinthe current description is definitely wrong imo05:21
pgavinwhich is why I guess it helped when I removed it05:21
stekernI think they roughly describes or1200 with a serial multiplier...05:22
stekernif not removing them, at least making them match something a "normal" pipeline would do is probably better05:22
pgavindoes the or1200 need 2 cycles for alu ops?05:22
pgavinyeah, I tried that05:23
pgavinit was worse05:23
pgavinand I added some bypasses, which didn't seem to help05:23
stekernworse than what's there now, or worse than removing it all together (or both)?05:23
pgavinworse than both05:23
stekernhaha, wut???05:23
pgavinyeah, adding a description of what you'd think a standard or1k pipe would do made it worse05:24
pgavinmaybe my description was just bad05:24
pgavinI don't know05:24
stekernsounds crazy...05:25
stekernjeremypbennett: I think we could need some expert input on this, do you think you could poke Joern if he'd have a split second to spare?05:25
pgavinthis is more-or-less what I tried:
pgavinwhich was worse05:31
pgavinoh, I tried it without the bypass, too05:32
stekernyeah, I figured you had05:34
pgavinI tried it with some adjustments to the load/store stuff too05:34
pgavinnone of the versions I tried actually helped05:34
stekernI at least would have thought that what you had in your paste would have been better...05:36
pgavinbased on the description in, yes05:36
pgavinand I had some other stuff for branches05:36
pgavinI added a "flag" unit that was written by setflag, and read by branches05:36
pgavinthat's how I made it split the two up05:36
stekernI reckon you tried mucking about with the mul_class value as well?05:37
pgavinI tried 1 cycle and 2 cycles05:37
pgavinmy attempts weren't very scientific I'll admit05:37
stekernto be really correct, the div and mul should be divided in two seperate definitions05:37
pgavinI didn't keep good track of what effect each change had :)05:38
pgavinyes, that's true05:38
stekernand in the mor1kx case, the div is 32 cycles05:38
pgavinI don't think the code I ran had divide in ti05:38
pgavinjust mul05:38
stekernok, and in either case, the div instructions are marked as mul anyway, so it's not a change from the original desc05:39
stekernwhat would be interesting to do would be to check if it's some particular change that makes it worse. But it's still just weird that it's worse..05:43
pgavinmessing with the load/store had the biggest effect05:54
stekernhmm, ok06:02
stekernmaybe the mor1kx is a bit uncommon, with the storebuffer being large and dumb06:03
stekernideally, with that setup you want loads being moved away as far as possible from stores06:04
stekernand long sequences of stores are virtually of no cost06:04
stekernno *extra* cost, I mean06:04
stekernand by far away, I mean far away *after* the stores06:05
pgavingood point06:07
olofkpgavin: Regarding statistics in the simulator, I have been kickin' around an idea of dumping CPU state (GPRs, PC, etc) in an Sqlite db via VPI. Should be pretty easy, but haven't had time to try it out06:46
olofkThe thought came up when I tried to boot Linux in Icarus and got a several GB of plaintext logs after a few hours06:47
olofkHaving it in sqlite or something similar would enable a lot more post processing analysis as well06:48
olofkIt shouldn't be harder than creating a VPI wrapper around some sqlite functions, bake it into a core and hook it up in *or1k*-monitor instead of the $write and $display statements06:49
olofk</brain dump>06:50
LoneTecholofk: might be better to use $write into a pipe; VPI itself is very inefficient in some simulators07:01
olofkLoneTech: really? Hmm.. I didn't know that. Maybe I should stop trying to accelerate too much stuff with VPI then07:03
LoneTechit could be another architectural fault, of course, but I remember the vpi debug interface for orpsoc eating all the cpu it could07:04
olofkLoneTech: That could be that it was looping waiting for data07:05
olofkNot sure, but I also remember seeing that07:05
olofkstekern: I closed bug #2507:09
olofkLoneTech: Btw, I think that at least Icarus implements the verilog syscalls in a VPI module, so the overhead would be similar07:17
LoneTechbbl, going to office07:23
LoneTecholofk: I think we can pretty safely reassign the blame for cpu usage on the amazing code structure of the orpsoc vpi debug thingy. it does things like emulating a blocking read in get_rsp_char12:14
LoneTechthough the way to feed external events into vpi isn't very neat either (looks like it's sim-driven polling or OS signals)12:19
LoneTechhm. I can reduce the PLT scaling to two words per entry, at the extra cost of one jump per PLT call. very dependent on cache and call pattern if that's a gain15:24
LoneTechI'm leaning towards no. the extra jump causes another delay slot, which can't do much useful, and might cost us an icache collission we don't need.15:29
_franck_(memcpy) should we have an arch specific memcpy then ? looks like most (all ?) of Linux arch that have a memcopy do it in assembler16:31
_franck_dalias: (memcpy) good to know. If I had to make an asm memcpy, I would compile it down and tweaked by hand like pgavin said16:35
_franck_well, memcpy complied with -O3 gives me 387 lines of assembly code... I hope there is some room for optimization16:44
blueCmdstekern: where is my daily commit!?17:50
blueCmdolofk: I glanced over the bugs you listed (or rather what is under binutils in bugzilla) and I couldn't find any that where 100% obvious that they are fixed17:58
stekernblueCmd: your poor thing, I didn't feed you one all day yesterday!18:24
stekernhopefully I'll get one together tonight18:25
stekernI'm still poking at the spinlocks and I've decided on ticket spinlocks, right now I'm in the middle of stealing code from the arm port ;)18:26
blueCmdyes, the arm port is good to steal from18:29
stekernI usually like to glance at the arc port18:34
stekernarm is often too obfuscated by all the different cpu options18:35
stekern...but I'm a slow stealer, since I'm not fluent in arm asm :(19:08
stekernhmm, who can come up with the most optimal way to do: if ((var >> 16) == (var & 0xffff))19:17
stekernin or1k asm19:17
stekernlet's assume var is in r319:19
pgavinl.srli r1,r3,16; l.andi r2,r3,0xffff; l.sfeq r1,r219:19
stekernl.rori r4, r3, 1619:19
stekernl.sfeq r4,r319:20
stekernbut I guess we can't count on l.rori19:20
stekernso, yours is probably the best we can do with class 1 insn19:21
_franck_I don't understand something in the calling of "void *memcpy(void *restrict dest, const void *restrict src, size_t n)"19:42
_franck_function parameters are r3, r4 and r5 right ?19:42
_franck_it's not but as I'm reading the ABI (barely for the fiest time) it says it should19:44
stekernwhat is not?19:44
_franck_because when I'm reading the asm code, I understand that r6 is tested rigth away and it's dest (or src, didn't look further)19:45
stekernr6 = src & 0x319:47
olofkstekern: Nice to hear about MiSoC19:47
_franck_I sould start learning asm ;) How does juliusb say ?19:47
* _franck_ slaps his fronthead19:48
stekernI think it's forehead, but close enough ;)19:48
_franck_I wasn't sure but you got it :)19:49
olofkNow I know what it feels to be slashdotted. It must just have been pure luck that Google's blog servers could handle all the nearly 3000 visitors I had yesterday19:53
olofkI tried to test the code in bug #62. N00b question of today: Why do I get "Error: unresolved expression that must be resolved"20:47
olofkOr is that the bug? :)20:51
juliusbI can feel another forehead slap coming up... but I'll ask anyway21:16
juliusbwhere is the memory map defined for the systems in fusesoc?21:16
juliusbah ok, not so obvious (so not worth of a slap this time), in <system>/data/wb_intercon.conf21:19
juliusbSo, any chance I could get an indication of how mature the de0_nano board in orpsoc-cores is?21:20
juliusb... and which software runs on it? u-boot?22:45
juliusbrather, I know u-boot probably can run on it, but is there stuff which would be easy to roll out in a workshop which *just works* on that de0_nano build in orpsoc-cores?22:47
--- Log closed Fri May 16 00:00:50 2014

Generated by 2.15.2 by Marius Gedminas - find it at!