IRC logs for #openrisc Monday, 2014-11-03

--- Log opened Mon Nov 03 00:00:06 2014
poke53282stekern: Your starfield works02:59
poke53282I don't have a palette based framebuffer, but you can see that it works.03:07
poke53282What I don't like is, that you can't see the parallelized cores with this demo.03:07
poke53282Stekern: I have added a mandelbrot calculator. It is slow. But it shows the parallelization very nicely.04:16
poke53282just start "mandelpar"04:16
poke53282and take this link: ""04:22
stekernpoke53282: nice, do you have sources to it to share?06:00
poke53282nothing special.06:05
poke53282Just a little bit searching and a few changes.06:05
poke53282speed is not important here. I wanted to have a long calculation in which every core has something to do.06:06
stekernyeah, I mostly just want to try to run it on real hw06:10
stekernI never looked at any openmp code before06:12
poke53282openmp is simple.06:21
poke53282with one pragma you can parallelize loops for example.06:22
poke53282I have parallelized most of my simulations with openmp. And often only with a few extra lines of openmp code.06:23
poke53282In principle only one line is important06:24
poke53282#pragma omp parallel for schedule(guided)06:24
poke53282the schedule is usually not necessary. But it gives this typical pattern on the screen.06:24
poke53282#pragma omp parallel06:25
poke53282is enough. Also for your starfield.06:25
poke53282for very complicated parallelization patterns like in the kernel it is not good.06:26
stekernyeah, it looked pretty straight forward from what I read on wikipedia06:29
stekernyour mandelbrot needs libgomp, which I don't have though ;)06:30
poke53282Yes. remove --disable-libgomp06:30
poke53282from the gcc configuration06:31
stekernfrom?06:31  in around one minute available06:32
stekernah, well I'm alredy building it here ;)06:34
stekernpoke53282: hmm, mandelpar crashed with a bus error here06:55
poke53282mmap and pointer is a short int?06:55
poke53282no char06:55
poke53282But probably this is not the reason.06:55
poke53282this would lead to a segmentation fault I guess.06:56
stekernit's only finding 1 core too06:57
poke53282cat /sys/devices/system/cpu/online06:58
poke53282what is the output?06:58
poke53282the output makes no sense. If he is finding only one core the thread id should be always zero and not two.06:59
stekerndo you have a compiled version of mandelpar I can try06:59
poke53282yes, in the emulator copy /usr/bin/mandelpar to ~07:00
poke53282then click on this little box under the terminal and you get a .tar file07:00
stekernunrelated, but why are the "SJK DEBUG" printouts related to the timer showing c=0?07:02
poke53282ahh, because the retiurn value of this timer is always zero. But writing to ttcr is ignored for the smp kernel.07:03
poke53282So ttcr is the same for each core.07:03
poke53282in my implementation.07:03
poke53282And it is increased like a global timer.07:04
stekernI see07:06
poke53282It's also a hack :)07:06
poke53282Neither better nor worse than yours.07:06
poke53282I ignore your hack and make my own. :)07:07
stekernyours is slightly cleaner, but would have been hard to make that work in hw07:08
stekernsame thing happens with your mandelpar07:09
poke53282and if you exchange libgomp?07:10
stekernthe "Using 1 cores" is in jor1k too07:10
poke53282Hmm, I remember, that this was correct at some point.07:11
poke53282But I have never really checked07:11
poke53282yes, you are right. It shows always zero.07:14
stekernmaybe it's just because I have 640x480-807:14
poke53282omp_get_num_threads: Size of the active team07:16
poke53282I execute this function outside of the loop07:16
poke53282I should have used omp_get_max_threads07:16
poke53282So, the output is correct.07:17
poke53282Just executed the wrong function.07:17
poke53282at the wrong place.07:17
poke53282What you should get is a segmentation fault and not a bus error?07:17
poke53282But yeah, this is what I said before. I use a 16 bit framebuffer and don't check in the program.07:18
poke53282But this error shouldn't happen in line 007:18
stekerndoes it start from line 007:19
stekernthat's what it says07:19
poke53282yes, so it writes at the beginning of the framebuffer. This should be Ok07:20
poke53282I hope you removed the -mhard-float07:20
poke53282in the compilation command.07:20
stekernhow did I not see that? ;)07:21
poke53282It's time that you include the fp instructions.07:24
stekernyeah, I should test the withfpu branch anyway07:25
poke53282It's the 20th anniversary of the fdiv bug. Would be nice to see a reimplementation of this bug.07:27
poke53282sleeping time07:33
olofkstekern, poke53282 : What have I said about talking when I'm not here? It's very rude to make me start my morning by spending half an hour reading through the backlog08:00
olofkoh... and awesome work btw :)08:00
stekernI think I've figured out why my in-kernel emulation of the atomic ops fails too08:08
stekernit's those darn fix-ups08:08
stekerni.e. when an in-kernel l.lwa tries to access a userspace area and fails08:10
wallentoI have put the landing page on the main wiki page at
wallentowe should fill it up now09:44
maxpalnHi, a while ago I remember having a discussion about improving the wishbone BFM transactor testbench (the one that fires random read/write traffic at a wishbone interface)12:00
maxpalnI want to implement many of these changes as part of the improvements to the DDR3 memory interface12:01
maxpalnI just wanted to check that no one has already done this before I duplicate effort12:01
maxpalnor rather, I wanted to check whether anyone has improved the wishbone transactor recently (since this summer).12:02
olofkmaxpaln: No work has been done on that, but I'm very interested in improvements as well since I'm also working on some memory controller improvements12:10
olofkThose might be of interest for you as well actually12:10
maxpalnoh really?12:16
maxpalnThe improvements I am planning to the transactor are along the lines of making the transactions completely random12:17
maxpalnI am thinking a good approach is to break the tests into a range of addresses, say 1000 (or even 10 its up to the user really).12:18
maxpalnat the start of the test do a burst write across the full range of addresses - this will initialise the wishbone memory interface and a local buffer to the same values12:18
maxpalnthen allow the testbench to generate random read and write bursts to the memory.12:19
maxpalnall writes will be mirrored in the local buffer so that any reads should produce a match12:19
maxpalnthe number of transactions per test (i.e. the number of transactions carried out on the current range of addresses) is probably something the user will define12:20
maxpalnbut after that everything will be random: whether the transaction is read or write, what burt type, what length etc.12:20
maxpalnI'll also implement stekern's suggestion of random number of wait states between transactions (between 0 and a few).12:21
maxpalnthis should make the transactor pretty representative of a real world system *I think*12:21
olofkmaxpaln: That sounds all good, and I would very much like to have that12:21
maxpalnwell, hopefully I'll have something in a few days - I have some time between other stuff this week12:22
olofkThat would be awesome12:22
olofkHave you looked at wb_sdram_ctrl?12:22
maxpalnerm, I think so - is that the one with the mulitple ports12:23
maxpalnyeah, that was a neat implementation - but there was something about it that made it tricky for me to reuse.12:23
olofkand it has a small cache for values that are read back from RAM12:23
maxpalnah, yeah - I remnember, that bit was very nice12:23
olofkWhat I'm doing now is pulling out the arbiter and cache stuff to a separate component with a standardized interface to the mem controller12:24
maxpalnOooh, nice12:24
olofkThe interface will be two write channels (one for command and address, and one for data and mask), and one return channel with data and address12:24
olofkIt will also be possible to have different widths for wishbone and the mem controller12:25
olofkBut I can't give you an ETA on that, so you should probably keep your single wishbone port for now12:26
olofkBut once it's ready it probably shouldn't be too hard to just hook up your ddr controller to this interface12:26
maxpalncool - no rush, my controller is working well at the moment. I am trying to simplify it a little at the moment before adding support for arbitrary burst lengths on incrementing bursts12:29
maxpalnbut i've broken something in HW that works in Sim - hence the rework of the transactor12:30
maxpalnI'm bored of the logic analyser for now !12:32
olofkI know the feeling :)12:32
poke53282olofk: Yes, we were busy this weekend16:30
poke53282Hmm, interesting. when the ompic times out, the destination cpu is either at pc=0x500 (tick timer) or in an raw_spin_lock_irqsave.17:10
poke53282Let me guess, where this spinlock is :)17:11
poke53282stekern: Ok the following. ompic tries to raise an irq on cpus, which are in the irqsave spinlock of this device.17:14
poke53282ooops, I guess being in the main loop of raw_spin_lock_irq with disabled interrupts is wrong.19:09
olofkehm...  has anyone heard of the SYMPL FP3250 GPGPU before?19:11
olofkGoogle doesn't give me any info19:11
olofkI can't really make out if this is a serious thing or not19:13
poke53282If google can't find it doesn't exists.19:14
poke53282stekern: What I know is, that all the cores, that don't react are stalled somewhere in raw_spin_lock_irq or in raw_spin_lock_saveirq. Everytime with disabled interrupts. The functions with the spinlocks are mainly ompic_raise_softirq, rcu_process_callbacks and schedule.20:33
stekernpoke53282: ok, but the one that is holding the lock should be releasing it at some point21:07
stekernunless it tries to send a second IPI to the same core before the first one has got through, but IIRC that shouldn't be possible21:08
stekernI could be wrong though21:10
stekernthe IPI stuff is here:
poke53282I wish I would know what's going on.21:29
poke53282The statistics are clear. Everytime the cpu which is called hangs in a spinlock21:29
stekernbut obviously that is what is happening for you? one cpu is trying to send a second IPI to a core that hasn't yet received the first one?21:32
stekernmaybe that's expected... that's problematic then21:32
poke53282By the way. You implemented the emulation of swa and lwa in the kernel with the virtual address and not the physical.21:32
stekernthat was intended, is that a problem?21:33
poke53282No, not really. My hardware implementation uses the physical address.21:34
poke53282You told me so.21:34
poke53282Everytime you send via ompic an irq signal to a cpu which has not acknowledged the previous request, I check the pc of the destination cpu21:35
stekernright, so does mine... maybe that is a problem21:35
poke53282And everytime, the pc shows me, that the cpu hangs in an spinlock.21:36
poke53282stekern: Not sure, But I don't use the emulation. So I don't care at the moment. Just wanted it to mention. But this is not the problem here.21:36
stekern...but only a theoretical one I think21:37
poke53282Ok, one question about the spinlock here:
stekernthe emulation isn't perfect anyway, a normal store to the same address will not break the link21:38
poke53282lock->slock should increase by (1<<TICKET_SHIFT) everytime a process enters the routine.21:39
stekern(emulation) the important check is that the flag is still set, the address compare only serve as a slight sanity check21:40
poke53282And if both processes enter at the same time,  the lwa and swa part should make sure, that it is not only increased one time.21:41
poke53282and this I can implement by only using one linked register for all cores.21:43
stekernis that a statement or a question? ;)21:43
poke53282my cores are still working sequential. Just switching now and then by a very complex ruling scheme.21:45
stekernhow do you determine what core has done the link?21:45
poke53282I set the linked register to -121:46
poke53282Because I don't support 4GB of RAM, this should be safe.21:46
stekernso you mean the address?21:46
poke53282I don't have a flag.21:47
stekernok, so you've combined the address and the flag to one variable, that is -1 when the flag is cleared. That's fine.21:47
stekernbut still, how do you know which core the register belong to?21:47
poke53282I don't know at the moment.21:48
stekernand even if they are running sequentially, how do you link two cores at the same time?21:48
stekern(imagine two unrelated spinlocks)21:48
poke53282yes, two unrelated spinlocks would clear the register for the other spinlock.21:49
poke53282But usually this triple lwa,add,swa is executed at once for one core.21:49
stekernthat was my next question, can the opposite happen?21:50
poke53282the trylock might be a problem here.21:50
poke53282The question is, if the flags or linked address registers are not connected, how can the lwa,add,swa triple work on multicore systems?21:51
poke53282if two cores enter at the same time.21:52
stekernwell, they are connected in the sense that a store (from another core) to the linked address will break the link21:53
poke53282that means, formally if I have 16 cores, all 16 linked addresses must be checked for one swa?21:54
stekernso, if two cores enter the lwa,add,swa at exact the same time, the one that get first to the swa succeed, all other fail21:54
stekernin hardware, this is handled by the snooping. and the databus can only be held by one cpu at a time21:55
stekernso if all cpus assert the store at exactly the same time, the arbiter will decide who wins21:55
stekernyes, I think you have to do it like that21:58
stekerncheck all 16 linked addresses21:58
poke53282Ok, will try.22:00
poke53282But I hope, I have to do it only in the l.swa part. Not for l.sw as well22:01
stekernto make it faster, maybe splitting up the link flag into one variable that you can AND with and only actually check the addresses in a slow path22:02
poke53282yes, something like a linkflag_bitfield22:02
stekernmaybe you can do without checking on normal stores, but consider how the spinlock unlock works:
poke53282Implemented, but nothing changed.22:04
stekernif that happens when another core is at the l.add and the l.swa is actually performed, you will overwrite what the unlock did22:05
poke53282Yes, but the unlock doesn't use lwa and swa. And the check for this variable is done in a ordinary while loop. So, this is not a problem.22:06
stekernlock->tickets.owner is the same as lock->slock & 0xffff22:07
poke53282But in one case, it is lock->slock and in the other one
poke53282Ok, a union22:08
poke53282can we use lwa,addi,swa,bnf  for  lock->tickets.owner++?22:09
stekernI guess you could, but it shouldn't be needed ;)22:11
stekernlock->tickets.owner++ will be a l.lhz;l.sh22:12
poke53282Let's see. I prevent changing the core while the linked register flag is set.22:13
stekernso if the lock lwa,add,swa performs the swa between the lhz and sh, it doesn't matter, since it didn't change the lower 16-bit22:13
stekernif you do that, you'll run into problems with the trylock22:15
stekernsince that can set the flag without 'clearing' it (by the swa)22:15
poke53282Yes, I see22:16
poke53282This didn't solve the problem anyhow.22:17
poke53282If I change the core after every instruction it always hangs at some point.22:18
poke53282Yes, the cpu hangs in a raw_spin_lock22:24
poke53282Next time, can you invent atomic functions, which are easier to handle? ;)22:25
poke53282More emulator friendly.22:25
stekerna cmpxchg is probably a lot easier to implement in an emulator, yes ;)22:36
poke53282At least I make progress. When I switch the core after each instruction, they end up in the same spinlock forever.22:47
--- Log closed Tue Nov 04 00:00:13 2014

Generated by 2.15.2 by Marius Gedminas - find it at!