IRC logs for #openrisc Thursday, 2014-05-15

--- Log opened Thu May 15 00:00:49 2014
poke53281	Yes, it's linked with libc as far as I know.	00:02
dalias	_franck_, fyi, i have a pretty generic high-performance "C" implementation of memcpy that does the word shuffling for alignment, as part of musl libc	00:46
dalias	it's modeled after the way you'd do it in asm for most risc isa's, and is competitive with the asm implementations for most (iirc it's something like 80-90% of the speed of android's memcpy.s on arm)	00:47
pgavin	dalias: I suppose it could be compiled down and tweaked by hand	01:07
dalias	yes	01:09
dalias	i suspect it's closer to optimal on other archs	01:11
dalias	iirc it was the compiler's failure to generate huge ldm/stm's that kept it from being optimal on arm	01:12
dalias	amusingly it's pathologically bad on microblaze in the default gcc config	01:12
dalias	because minimal microblaze lacks a barrel shifter and gcc defaults to not generate shift instructions except <<1 and >>1	01:13
dalias	so it generates unrolled >>1 and <<1 loops for the 8-, 16-, and 24- shifts :-p	01:13
pgavin	lol	01:29
pgavin	yeah, I figure the worst thing there will be in the gcc output is some stack references that can be removed	01:29
pgavin	gcc does pretty well	01:30
pgavin	still some problems I'm trying to figure out :)	01:30
stekern	_franck_: Linux doesn't use libgcc for memcpy in the generic case: http://lxr.free-electrons.com/source/lib/string.c#L589	03:29
stekern	even though that isn't, as dalias said, you can make a pretty efficient one in C	03:29
stekern	but, in the kernel, there are copy_to/from_user, that have the exception table entries in them, so they kind of have to be in asm	03:32
stekern	and if you have written an efficient copy_to/from_user (we haven't), then you basically have an efficient memcpy too	03:33
stekern	(for reference, here's our copy_to/from_user: http://lxr.free-electrons.com/source/arch/openrisc/lib/string.S#L36)	03:34
pgavin	stekern: this is strange... I've removed the entire pipeline description from the or1k.md file and the code performs better	04:10
pgavin	every time I add more info to the description it gets worse	04:10
pgavin	I'm tempted to commit a patch the removes the description that's there	04:12
stekern	funny...	04:24
stekern	maybe our pipeline implementation in mor1kx is all backwards ;)	04:25
pgavin	yeah, I don't know	04:25
pgavin	it's annoying	04:25
pgavin	so for or1k-headers, I'm going to write an xml file that descripts the sprs, and a python script to create a header from it	04:26
stekern	do you have an idea what changes with the pipeline description that makes it worse?	04:27
pgavin	then if other formats are needed (verilog) it's easy to port	04:27
pgavin	the pipeline description seems to make the code bigger, for one	04:27
pgavin	that's probably the biggest thing	04:27
pgavin	but it even executes fewer instructions in or1ksim	04:28
pgavin	not many fewer, but some	04:28
stekern	I think that's the first good use of xml Iäve ever heard ;)	04:28
pgavin	lol	04:28
pgavin	I have a similar thing for another cpu I'm working on	04:28
pgavin	part of my dissertation :)	04:28
stekern	usually when somebody mentions xml, I go 'ick', but for this... I think that makes sense ;)	04:29
pgavin	yeah, it worked out well for that one	04:29
pgavin	so I figured it would be ok here too :)	04:29
stekern	hmm, so have you tried increasing cache sizes?	04:31
stekern	what kind of benchmarks do you run?	04:31
pgavin	I just have 2 really	04:33
pgavin	I got frustrated after trying two of them	04:33
pgavin	maybe I should do more	04:34
pgavin	one is qsort on an array of strings and one is crc	04:34
pgavin	I figured both should be easy to optimize	04:34
pgavin	I didn't increase the cache size	04:34
pgavin	but I don't think that should matter	04:34
pgavin	the datasets are bigger than the cache could ever be in any case	04:35
stekern	ok, but that should be unrelated to icache	04:35
pgavin	true	04:36
pgavin	both algorithms should fit in the cache	04:36
stekern	I'm just thinking if it's just related to code size growth	04:36
pgavin	I don't think so	04:36
stekern	or is the actual code larger and slower	04:36
pgavin	well I mean I don't think the code is growing to where it's bigger than the cache	04:36
stekern	ok	04:37
pgavin	it's only growing by a few hundred bytes in the worst case	04:37
stekern	larger code of course means slower code, even if it fits in cache	04:37
pgavin	yes	04:37
pgavin	well maybe 200 bytes is enough	04:37
pgavin	any way to enable statistics on the simulator? other than cycles I mean	04:38
stekern	that's a lot if the original code was 10 bytes ;)	04:38
pgavin	indeed :)	04:38
pgavin	but I diffed the assembly output	04:38
stekern	I guess it wasn't though	04:38
pgavin	and it didn't look unreasonable	04:38
pgavin	even the simplest description seems to make things worse though	04:39
stekern	interesting	04:39
stekern	I've been running coremark to benchmark mor1kx a lot, but I've been as lazy as you when it comes to diversity in the benchmarks	04:40
pgavin	well it seems to me that my optimization shouldn't make something as common as qsort perform worse	04:41
pgavin	and if it does, I don't want to commit it	04:41
stekern	coremarks data sets and code fits in around 8KB of cache	04:41
stekern	yeah, I agree	04:42
pgavin	is coremark freely available?	04:42
pgavin	ah, eembc	04:43
pgavin	have to pay	04:43
stekern	yes, you need to download the 'core' code from eembc	04:43
stekern	+but	04:43
stekern	afaik, it's free of charge to download	04:43
pgavin	ah, ok	04:43
pgavin	cool	04:43
stekern	at least, I haven't payed them anything ;)	04:43
pgavin	I'll use this then	04:43
stekern	let me give you the or1k port	04:44
pgavin	ok	04:44
stekern	http://oompa.chokladfabriken.org/openrisc/or1k.tar.gz	04:46
pgavin	thx	04:46
stekern	what does that "pipeline" description actually model?	04:53
stekern	there's that comment above them that says: "I think this is all incorrect for the OR1K. The latency says when the result will be ready, not how long the pipeline takes to execute."	04:54
stekern	and why is the "alu_unit" assigned 2?	04:56
pgavin	the latency is the default number of cycles for a true dependency on the result	04:56
pgavin	not sure	04:56
pgavin	I tried setting it to 1	04:56
pgavin	don't recall if it helped now	04:56
pgavin	I think it did better	04:56
pgavin	but removing the whole description was the best	04:57
stekern	what does other archs do with that`?	04:57
stekern	I'm all for removing it, if we should have one, it should be implementation specific ones, not arch specific	04:58
stekern	especially if removing it makes better results ;)	04:58
pgavin	the mips arch does similar stuff	05:02
pgavin	I took a look at it	05:02
pgavin	I thought it was a straightforward thing to do lol	05:02
pgavin	these are the docs for it: https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.html	05:03
pgavin	yeah, I agree that it's ok for the default to assume 1 cycle/instruction and essentially optimize for size	05:05
pgavin	but would be nice to teach it a bit about the implementations we have	05:05
stekern	yeah, it kind of sounds crazy that removing it would increase performance...	05:05
pgavin	yeah, that's what I thought lol	05:05
stekern	unless some of our instructions are marked wrongly	05:07
stekern	but I guess your disasm comparisons should have revealed that	05:07
pgavin	the changes to the generated code were more significant the more complex my description was	05:09
pgavin	which I suppose makes sense	05:09
pgavin	in some cases it was hard to figure out what it was doing	05:09
pgavin	so I tried making tiny changes, and they looked ok on paper I guess	05:09
pgavin	but the result was slower	05:09
stekern	weird...	05:10
pgavin	maybe I should try some smaller benchmarks	05:11
pgavin	like, a few nested loops or something	05:12
stekern	what does the 3 in this do? or1k_alu3	05:18
pgavin	it means it spends 3 cycles in the alu	05:19
pgavin	if you have unit1+unit2 it means it spends a cycle in both of those units	05:19
pgavin	and unit1,unit2 means a cycle in unit1, then a cycle in unit2	05:19
stekern	ah, ok	05:19
pgavin	and "," binds tighter than "+"	05:20
pgavin	and parens are allowed	05:20
stekern	so that *3 is a lie, at least for mor1kx cappuccino	05:20
pgavin	yes	05:21
pgavin	the current description is definitely wrong imo	05:21
pgavin	which is why I guess it helped when I removed it	05:21
stekern	I think they roughly describes or1200 with a serial multiplier...	05:22
pgavin	ok	05:22
stekern	if not removing them, at least making them match something a "normal" pipeline would do is probably better	05:22
pgavin	does the or1200 need 2 cycles for alu ops?	05:22
pgavin	yeah, I tried that	05:23
pgavin	it was worse	05:23
pgavin	and I added some bypasses, which didn't seem to help	05:23
stekern	worse than what's there now, or worse than removing it all together (or both)?	05:23
pgavin	worse than both	05:23
stekern	haha, wut???	05:23
pgavin	yeah, adding a description of what you'd think a standard or1k pipe would do made it worse	05:24
pgavin	maybe my description was just bad	05:24
pgavin	I don't know	05:24
stekern	sounds crazy...	05:25
stekern	jeremypbennett: I think we could need some expert input on this, do you think you could poke Joern if he'd have a split second to spare?	05:25
pgavin	this is more-or-less what I tried: http://pastie.org/private/bygoojtfhlksdpsty0egka	05:31
pgavin	which was worse	05:31
pgavin	oh, I tried it without the bypass, too	05:32
stekern	yeah, I figured you had	05:34
pgavin	I tried it with some adjustments to the load/store stuff too	05:34
pgavin	none of the versions I tried actually helped	05:34
stekern	I at least would have thought that what you had in your paste would have been better...	05:36
pgavin	based on the description in gccint.info, yes	05:36
pgavin	and I had some other stuff for branches	05:36
pgavin	I added a "flag" unit that was written by setflag, and read by branches	05:36
pgavin	that's how I made it split the two up	05:36
stekern	I reckon you tried mucking about with the mul_class value as well?	05:37
pgavin	yes	05:37
pgavin	I tried 1 cycle and 2 cycles	05:37
pgavin	my attempts weren't very scientific I'll admit	05:37
stekern	to be really correct, the div and mul should be divided in two seperate definitions	05:37
pgavin	I didn't keep good track of what effect each change had :)	05:38
pgavin	yes, that's true	05:38
stekern	and in the mor1kx case, the div is 32 cycles	05:38
pgavin	I don't think the code I ran had divide in ti	05:38
pgavin	just mul	05:38
stekern	ok, and in either case, the div instructions are marked as mul anyway, so it's not a change from the original desc	05:39
stekern	what would be interesting to do would be to check if it's some particular change that makes it worse. But it's still just weird that it's worse..	05:43
pgavin	messing with the load/store had the biggest effect	05:54
stekern	hmm, ok	06:02
stekern	maybe the mor1kx is a bit uncommon, with the storebuffer being large and dumb	06:03
stekern	ideally, with that setup you want loads being moved away as far as possible from stores	06:04
stekern	and long sequences of stores are virtually of no cost	06:04
stekern	no extra cost, I mean	06:04
stekern	and by far away, I mean far away after the stores	06:05
pgavin	hmm	06:07
pgavin	good point	06:07
olofk	pgavin: Regarding statistics in the simulator, I have been kickin' around an idea of dumping CPU state (GPRs, PC, etc) in an Sqlite db via VPI. Should be pretty easy, but haven't had time to try it out	06:46
olofk	The thought came up when I tried to boot Linux in Icarus and got a several GB of plaintext logs after a few hours	06:47
olofk	Having it in sqlite or something similar would enable a lot more post processing analysis as well	06:48
olofk	It shouldn't be harder than creating a VPI wrapper around some sqlite functions, bake it into a core and hook it up in or1k-monitor instead of the $write and $display statements	06:49
olofk	</brain dump>	06:50
LoneTech	olofk: might be better to use $write into a pipe; VPI itself is very inefficient in some simulators	07:01
olofk	LoneTech: really? Hmm.. I didn't know that. Maybe I should stop trying to accelerate too much stuff with VPI then	07:03
LoneTech	it could be another architectural fault, of course, but I remember the vpi debug interface for orpsoc eating all the cpu it could	07:04
olofk	LoneTech: That could be that it was looping waiting for data	07:05
olofk	Not sure, but I also remember seeing that	07:05
olofk	stekern: I closed bug #25	07:09
stekern	thanks	07:11
olofk	LoneTech: Btw, I think that at least Icarus implements the verilog syscalls in a VPI module, so the overhead would be similar	07:17
LoneTech	bbl, going to office	07:23
stekern	https://ssl.serverraum.org/lists-archive/devel/2014-May/003745.html	08:28
LoneTech	olofk: I think we can pretty safely reassign the blame for cpu usage on the amazing code structure of the orpsoc vpi debug thingy. it does things like emulating a blocking read in get_rsp_char	12:14
LoneTech	though the way to feed external events into vpi isn't very neat either (looks like it's sim-driven polling or OS signals)	12:19
LoneTech	hm. I can reduce the PLT scaling to two words per entry, at the extra cost of one jump per PLT call. very dependent on cache and call pattern if that's a gain	15:24
LoneTech	I'm leaning towards no. the extra jump causes another delay slot, which can't do much useful, and might cost us an icache collission we don't need.	15:29
_franck_	(memcpy) should we have an arch specific memcpy then ? looks like most (all ?) of Linux arch that have a memcopy do it in assembler	16:31
_franck_	dalias: (memcpy) good to know. If I had to make an asm memcpy, I would compile it down and tweaked by hand like pgavin said	16:35
_franck_	well, memcpy complied with -O3 gives me 387 lines of assembly code... I hope there is some room for optimization	16:44
blueCmd	stekern: where is my daily commit!?	17:50
blueCmd	olofk: I glanced over the bugs you listed (or rather what is under binutils in bugzilla) and I couldn't find any that where 100% obvious that they are fixed	17:58
stekern	blueCmd: your poor thing, I didn't feed you one all day yesterday!	18:24
stekern	hopefully I'll get one together tonight	18:25
stekern	I'm still poking at the spinlocks and I've decided on ticket spinlocks, right now I'm in the middle of stealing code from the arm port ;)	18:26
blueCmd	yes, the arm port is good to steal from	18:29
stekern	I usually like to glance at the arc port	18:34
stekern	arm is often too obfuscated by all the different cpu options	18:35
stekern	...but I'm a slow stealer, since I'm not fluent in arm asm :(	19:08
stekern	hmm, who can come up with the most optimal way to do: if ((var >> 16) == (var & 0xffff))	19:17
stekern	in or1k asm	19:17
stekern	let's assume var is in r3	19:19
pgavin	l.srli r1,r3,16; l.andi r2,r3,0xffff; l.sfeq r1,r2	19:19
stekern	l.rori r4, r3, 16	19:19
stekern	l.sfeq r4,r3	19:20
pgavin	true	19:20
stekern	but I guess we can't count on l.rori	19:20
stekern	so, yours is probably the best we can do with class 1 insn	19:21
_franck_	I don't understand something in the calling of "void memcpy(void restrict dest, const void *restrict src, size_t n)"	19:42
_franck_	function parameters are r3, r4 and r5 right ?	19:42
stekern	mmm	19:42
_franck_	it's not but as I'm reading the ABI (barely for the fiest time) it says it should	19:44
stekern	what is not?	19:44
_franck_	because when I'm reading the asm code, I understand that r6 is tested rigth away and it's dest (or src, didn't look further)	19:45
_franck_	http://pastie.org/private/b10eogsdkimjnoxfaq7kw	19:46
stekern	r6 = src & 0x3	19:47
olofk	stekern: Nice to hear about MiSoC	19:47
_franck_	I sould start learning asm ;) How does juliusb say ?	19:47
* _franck_ slaps his fronthead		19:48
stekern	I think it's forehead, but close enough ;)	19:48
_franck_	I wasn't sure but you got it :)	19:49
olofk	Now I know what it feels to be slashdotted. It must just have been pure luck that Google's blog servers could handle all the nearly 3000 visitors I had yesterday	19:53
olofk	I tried to test the code in bug #62. N00b question of today: Why do I get "Error: unresolved expression that must be resolved"	20:47
olofk	Or is that the bug? :)	20:51
juliusb	I can feel another forehead slap coming up... but I'll ask anyway	21:16
juliusb	where is the memory map defined for the systems in fusesoc?	21:16
juliusb	ah ok, not so obvious (so not worth of a slap this time), in <system>/data/wb_intercon.conf	21:19
juliusb	So, any chance I could get an indication of how mature the de0_nano board in orpsoc-cores is?	21:20
juliusb	... and which software runs on it? u-boot?	22:45
juliusb	rather, I know u-boot probably can run on it, but is there stuff which would be easy to roll out in a workshop which just works on that de0_nano build in orpsoc-cores?	22:47
--- Log closed Fri May 16 00:00:50 2014

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!