IRC logs for #openrisc Monday, 2014-11-03

--- Log opened Mon Nov 03 00:00:06 2014
poke53282	stekern: Your starfield works	02:59
poke53282	I don't have a palette based framebuffer, but you can see that it works.	03:07
poke53282	What I don't like is, that you can't see the parallelized cores with this demo.	03:07
poke53282	Stekern: I have added a mandelbrot calculator. It is slow. But it shows the parallelization very nicely.	04:16
poke53282	just start "mandelpar"	04:16
poke53282	and take this link: "s-macke.github.io/jor1k/index.html?cpu=smp&n=8"	04:22
stekern	poke53282: nice, do you have sources to it to share?	06:00
poke53282	http://pastie.org/9692650	06:05
poke53282	nothing special.	06:05
poke53282	Just a little bit searching and a few changes.	06:05
poke53282	speed is not important here. I wanted to have a long calculation in which every core has something to do.	06:06
stekern	yeah, I mostly just want to try to run it on real hw	06:10
stekern	I never looked at any openmp code before	06:12
poke53282	openmp is simple.	06:21
poke53282	with one pragma you can parallelize loops for example.	06:22
poke53282	I have parallelized most of my simulations with openmp. And often only with a few extra lines of openmp code.	06:23
poke53282	In principle only one line is important	06:24
poke53282	#pragma omp parallel for schedule(guided)	06:24
poke53282	the schedule is usually not necessary. But it gives this typical pattern on the screen.	06:24
poke53282	#pragma omp parallel	06:25
poke53282	is enough. Also for your starfield.	06:25
poke53282	for very complicated parallelization patterns like in the kernel it is not good.	06:26
stekern	yeah, it looked pretty straight forward from what I read on wikipedia	06:29
stekern	your mandelbrot needs libgomp, which I don't have though ;)	06:30
poke53282	Yes. remove --disable-libgomp	06:30
poke53282	from the gcc configuration	06:31
stekern	from?	06:31
poke53282	jor1k.com/packages/gcc.tar.bz2 in around one minute available	06:32
poke53282	now	06:33
stekern	ah, well I'm alredy building it here ;)	06:34
stekern	poke53282: hmm, mandelpar crashed with a bus error here	06:55
poke53282	mmap and pointer is a short int?	06:55
poke53282	no char	06:55
poke53282	But probably this is not the reason.	06:55
poke53282	this would lead to a segmentation fault I guess.	06:56
stekern	http://pastie.org/9692715	06:57
stekern	it's only finding 1 core too	06:57
poke53282	cat /sys/devices/system/cpu/online	06:58
poke53282	what is the output?	06:58
poke53282	the output makes no sense. If he is finding only one core the thread id should be always zero and not two.	06:59
stekern	0-3	06:59
stekern	do you have a compiled version of mandelpar I can try	06:59
poke53282	yes, in the emulator copy /usr/bin/mandelpar to ~	07:00
poke53282	then click on this little box under the terminal and you get a .tar file	07:00
stekern	unrelated, but why are the "SJK DEBUG" printouts related to the timer showing c=0?	07:02
poke53282	ahh, because the retiurn value of this timer is always zero. But writing to ttcr is ignored for the smp kernel.	07:03
poke53282	So ttcr is the same for each core.	07:03
poke53282	in my implementation.	07:03
poke53282	And it is increased like a global timer.	07:04
stekern	I see	07:06
poke53282	It's also a hack :)	07:06
poke53282	Neither better nor worse than yours.	07:06
poke53282	I ignore your hack and make my own. :)	07:07
stekern	yours is slightly cleaner, but would have been hard to make that work in hw	07:08
stekern	same thing happens with your mandelpar	07:09
poke53282	and if you exchange libgomp?	07:10
stekern	the "Using 1 cores" is in jor1k too	07:10
poke53282	Hmm, I remember, that this was correct at some point.	07:11
poke53282	But I have never really checked	07:11
poke53282	later	07:11
poke53282	yes, you are right. It shows always zero.	07:14
stekern	maybe it's just because I have 640x480-8	07:14
poke53282	omp_get_num_threads: Size of the active team	07:16
poke53282	I execute this function outside of the loop	07:16
poke53282	I should have used omp_get_max_threads	07:16
poke53282	So, the output is correct.	07:17
poke53282	Just executed the wrong function.	07:17
poke53282	at the wrong place.	07:17
poke53282	What you should get is a segmentation fault and not a bus error?	07:17
poke53282	But yeah, this is what I said before. I use a 16 bit framebuffer and don't check in the program.	07:18
poke53282	But this error shouldn't happen in line 0	07:18
stekern	does it start from line 0	07:19
stekern	that's what it says	07:19
poke53282	yes, so it writes at the beginning of the framebuffer. This should be Ok	07:20
poke53282	I hope you removed the -mhard-float	07:20
poke53282	in the compilation command.	07:20
stekern	doh...	07:21
stekern	how did I not see that? ;)	07:21
poke53282	It's time that you include the fp instructions.	07:24
stekern	yeah, I should test the withfpu branch anyway	07:25
poke53282	It's the 20th anniversary of the fdiv bug. Would be nice to see a reimplementation of this bug.	07:27
poke53282	;)	07:28
poke53282	sleeping time	07:33
stekern	night	07:36
olofk	stekern, poke53282 : What have I said about talking when I'm not here? It's very rude to make me start my morning by spending half an hour reading through the backlog	08:00
olofk	oh... and awesome work btw :)	08:00
stekern	;)	08:07
stekern	I think I've figured out why my in-kernel emulation of the atomic ops fails too	08:08
stekern	it's those darn fix-ups	08:08
stekern	i.e. when an in-kernel l.lwa tries to access a userspace area and fails	08:10
wallento	I have put the landing page on the main wiki page at opencores.org: http://opencores.org/or1k/Main_Page	09:44
wallento	we should fill it up now	09:44
wallento	;)	09:44
maxpaln	Hi, a while ago I remember having a discussion about improving the wishbone BFM transactor testbench (the one that fires random read/write traffic at a wishbone interface)	12:00
maxpaln	I want to implement many of these changes as part of the improvements to the DDR3 memory interface	12:01
maxpaln	I just wanted to check that no one has already done this before I duplicate effort	12:01
maxpaln	or rather, I wanted to check whether anyone has improved the wishbone transactor recently (since this summer).	12:02
olofk	maxpaln: No work has been done on that, but I'm very interested in improvements as well since I'm also working on some memory controller improvements	12:10
olofk	Those might be of interest for you as well actually	12:10
maxpaln	oh really?	12:16
maxpaln	The improvements I am planning to the transactor are along the lines of making the transactions completely random	12:17
maxpaln	I am thinking a good approach is to break the tests into a range of addresses, say 1000 (or even 10 its up to the user really).	12:18
maxpaln	at the start of the test do a burst write across the full range of addresses - this will initialise the wishbone memory interface and a local buffer to the same values	12:18
maxpaln	then allow the testbench to generate random read and write bursts to the memory.	12:19
maxpaln	all writes will be mirrored in the local buffer so that any reads should produce a match	12:19
maxpaln	the number of transactions per test (i.e. the number of transactions carried out on the current range of addresses) is probably something the user will define	12:20
maxpaln	but after that everything will be random: whether the transaction is read or write, what burt type, what length etc.	12:20
maxpaln	I'll also implement stekern's suggestion of random number of wait states between transactions (between 0 and a few).	12:21
maxpaln	this should make the transactor pretty representative of a real world system I think	12:21
olofk	maxpaln: That sounds all good, and I would very much like to have that	12:21
maxpaln	well, hopefully I'll have something in a few days - I have some time between other stuff this week	12:22
olofk	That would be awesome	12:22
olofk	Have you looked at wb_sdram_ctrl?	12:22
maxpaln	erm, I think so - is that the one with the mulitple ports	12:23
olofk	Yep	12:23
maxpaln	yeah, that was a neat implementation - but there was something about it that made it tricky for me to reuse.	12:23
olofk	and it has a small cache for values that are read back from RAM	12:23
maxpaln	ah, yeah - I remnember, that bit was very nice	12:23
olofk	What I'm doing now is pulling out the arbiter and cache stuff to a separate component with a standardized interface to the mem controller	12:24
maxpaln	Oooh, nice	12:24
olofk	The interface will be two write channels (one for command and address, and one for data and mask), and one return channel with data and address	12:24
maxpaln	cool	12:25
olofk	It will also be possible to have different widths for wishbone and the mem controller	12:25
maxpaln	neat	12:25
olofk	But I can't give you an ETA on that, so you should probably keep your single wishbone port for now	12:26
olofk	But once it's ready it probably shouldn't be too hard to just hook up your ddr controller to this interface	12:26
maxpaln	cool - no rush, my controller is working well at the moment. I am trying to simplify it a little at the moment before adding support for arbitrary burst lengths on incrementing bursts	12:29
maxpaln	but i've broken something in HW that works in Sim - hence the rework of the transactor	12:30
maxpaln	I'm bored of the logic analyser for now !	12:32
olofk	I know the feeling :)	12:32
poke53282	olofk: Yes, we were busy this weekend	16:30
poke53282	Hmm, interesting. when the ompic times out, the destination cpu is either at pc=0x500 (tick timer) or in an raw_spin_lock_irqsave.	17:10
poke53282	Let me guess, where this spinlock is :)	17:11
poke53282	stekern: Ok the following. ompic tries to raise an irq on cpus, which are in the irqsave spinlock of this device.	17:14
poke53282	ooops, I guess being in the main loop of raw_spin_lock_irq with disabled interrupts is wrong.	19:09
olofk	ehm... has anyone heard of the SYMPL FP3250 GPGPU before?	19:11
olofk	Google doesn't give me any info	19:11
poke53282	nope	19:11
olofk	I can't really make out if this is a serious thing or not	19:13
poke53282	If google can't find it doesn't exists.	19:14
poke53282	stekern: What I know is, that all the cores, that don't react are stalled somewhere in raw_spin_lock_irq or in raw_spin_lock_saveirq. Everytime with disabled interrupts. The functions with the spinlocks are mainly ompic_raise_softirq, rcu_process_callbacks and schedule.	20:33
stekern	poke53282: ok, but the one that is holding the lock should be releasing it at some point	21:07
stekern	unless it tries to send a second IPI to the same core before the first one has got through, but IIRC that shouldn't be possible	21:08
stekern	I could be wrong though	21:10
stekern	the IPI stuff is here: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/kernel/smp.c?h=smp	21:16
poke53282	I wish I would know what's going on.	21:29
poke53282	The statistics are clear. Everytime the cpu which is called hangs in a spinlock	21:29
stekern	but obviously that is what is happening for you? one cpu is trying to send a second IPI to a core that hasn't yet received the first one?	21:32
stekern	maybe that's expected... that's problematic then	21:32
poke53282	By the way. You implemented the emulation of swa and lwa in the kernel with the virtual address and not the physical.	21:32
stekern	yes	21:33
stekern	that was intended, is that a problem?	21:33
poke53282	No, not really. My hardware implementation uses the physical address.	21:34
poke53282	You told me so.	21:34
poke53282	Everytime you send via ompic an irq signal to a cpu which has not acknowledged the previous request, I check the pc of the destination cpu	21:35
stekern	right, so does mine... maybe that is a problem	21:35
poke53282	And everytime, the pc shows me, that the cpu hangs in an spinlock.	21:36
poke53282	stekern: Not sure, But I don't use the emulation. So I don't care at the moment. Just wanted it to mention. But this is not the problem here.	21:36
stekern	...but only a theoretical one I think	21:37
poke53282	Ok, one question about the spinlock here: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/spinlock.h?h=smp#n32	21:38
stekern	the emulation isn't perfect anyway, a normal store to the same address will not break the link	21:38
poke53282	lock->slock should increase by (1<<TICKET_SHIFT) everytime a process enters the routine.	21:39
stekern	right	21:39
stekern	(emulation) the important check is that the flag is still set, the address compare only serve as a slight sanity check	21:40
poke53282	And if both processes enter at the same time, the lwa and swa part should make sure, that it is not only increased one time.	21:41
stekern	right	21:41
poke53282	and this I can implement by only using one linked register for all cores.	21:43
stekern	is that a statement or a question? ;)	21:43
poke53282	question.	21:44
poke53282	my cores are still working sequential. Just switching now and then by a very complex ruling scheme.	21:45
stekern	how do you determine what core has done the link?	21:45
poke53282	I set the linked register to -1	21:46
poke53282	Because I don't support 4GB of RAM, this should be safe.	21:46
stekern	so you mean the address?	21:46
poke53282	yes	21:46
poke53282	I don't have a flag.	21:47
stekern	ok, so you've combined the address and the flag to one variable, that is -1 when the flag is cleared. That's fine.	21:47
poke53282	Yes	21:47
stekern	but still, how do you know which core the register belong to?	21:47
poke53282	I don't know at the moment.	21:48
stekern	and even if they are running sequentially, how do you link two cores at the same time?	21:48
stekern	(imagine two unrelated spinlocks)	21:48
poke53282	yes, two unrelated spinlocks would clear the register for the other spinlock.	21:49
poke53282	But usually this triple lwa,add,swa is executed at once for one core.	21:49
stekern	that was my next question, can the opposite happen?	21:50
poke53282	the trylock might be a problem here.	21:50
poke53282	The question is, if the flags or linked address registers are not connected, how can the lwa,add,swa triple work on multicore systems?	21:51
poke53282	if two cores enter at the same time.	21:52
stekern	well, they are connected in the sense that a store (from another core) to the linked address will break the link	21:53
poke53282	that means, formally if I have 16 cores, all 16 linked addresses must be checked for one swa?	21:54
stekern	so, if two cores enter the lwa,add,swa at exact the same time, the one that get first to the swa succeed, all other fail	21:54
stekern	in hardware, this is handled by the snooping. and the databus can only be held by one cpu at a time	21:55
stekern	so if all cpus assert the store at exactly the same time, the arbiter will decide who wins	21:55
stekern	yes, I think you have to do it like that	21:58
stekern	check all 16 linked addresses	21:58
poke53282	Ok, will try.	22:00
poke53282	But I hope, I have to do it only in the l.swa part. Not for l.sw as well	22:01
stekern	to make it faster, maybe splitting up the link flag into one variable that you can AND with and only actually check the addresses in a slow path	22:02
poke53282	yes, something like a linkflag_bitfield	22:02
stekern	maybe you can do without checking on normal stores, but consider how the spinlock unlock works: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/spinlock.h?h=smp#n83	22:04
poke53282	Implemented, but nothing changed.	22:04
stekern	if that happens when another core is at the l.add and the l.swa is actually performed, you will overwrite what the unlock did	22:05
poke53282	Yes, but the unlock doesn't use lwa and swa. And the check for this variable is done in a ordinary while loop. So, this is not a problem.	22:06
poke53282	?	22:06
stekern	lock->tickets.owner is the same as lock->slock & 0xffff	22:07
poke53282	But in one case, it is lock->slock and in the other one lockval.tickets.owner	22:07
poke53282	Hmmm	22:07
stekern	http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/spinlock_types.h?h=smp#n11	22:07
poke53282	Ok, a union	22:08
poke53282	can we use lwa,addi,swa,bnf for lock->tickets.owner++?	22:09
stekern	I guess you could, but it shouldn't be needed ;)	22:11
stekern	lock->tickets.owner++ will be a l.lhz;l.sh	22:12
poke53282	Let's see. I prevent changing the core while the linked register flag is set.	22:13
stekern	so if the lock lwa,add,swa performs the swa between the lhz and sh, it doesn't matter, since it didn't change the lower 16-bit	22:13
poke53282	Ok	22:15
stekern	if you do that, you'll run into problems with the trylock	22:15
stekern	since that can set the flag without 'clearing' it (by the swa)	22:15
poke53282	Yes, I see	22:16
poke53282	This didn't solve the problem anyhow.	22:17
poke53282	If I change the core after every instruction it always hangs at some point.	22:18
poke53282	Yes, the cpu hangs in a raw_spin_lock	22:24
poke53282	Next time, can you invent atomic functions, which are easier to handle? ;)	22:25
poke53282	More emulator friendly.	22:25
stekern	a cmpxchg is probably a lot easier to implement in an emulator, yes ;)	22:36
poke53282	At least I make progress. When I switch the core after each instruction, they end up in the same spinlock forever.	22:47
--- Log closed Tue Nov 04 00:00:13 2014

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!