--- Log opened Mon Nov 03 00:00:06 2014 | ||
poke53282 | stekern: Your starfield works | 02:59 |
---|---|---|
poke53282 | I don't have a palette based framebuffer, but you can see that it works. | 03:07 |
poke53282 | What I don't like is, that you can't see the parallelized cores with this demo. | 03:07 |
poke53282 | Stekern: I have added a mandelbrot calculator. It is slow. But it shows the parallelization very nicely. | 04:16 |
poke53282 | just start "mandelpar" | 04:16 |
poke53282 | and take this link: "s-macke.github.io/jor1k/index.html?cpu=smp&n=8" | 04:22 |
stekern | poke53282: nice, do you have sources to it to share? | 06:00 |
poke53282 | http://pastie.org/9692650 | 06:05 |
poke53282 | nothing special. | 06:05 |
poke53282 | Just a little bit searching and a few changes. | 06:05 |
poke53282 | speed is not important here. I wanted to have a long calculation in which every core has something to do. | 06:06 |
stekern | yeah, I mostly just want to try to run it on real hw | 06:10 |
stekern | I never looked at any openmp code before | 06:12 |
poke53282 | openmp is simple. | 06:21 |
poke53282 | with one pragma you can parallelize loops for example. | 06:22 |
poke53282 | I have parallelized most of my simulations with openmp. And often only with a few extra lines of openmp code. | 06:23 |
poke53282 | In principle only one line is important | 06:24 |
poke53282 | #pragma omp parallel for schedule(guided) | 06:24 |
poke53282 | the schedule is usually not necessary. But it gives this typical pattern on the screen. | 06:24 |
poke53282 | #pragma omp parallel | 06:25 |
poke53282 | is enough. Also for your starfield. | 06:25 |
poke53282 | for very complicated parallelization patterns like in the kernel it is not good. | 06:26 |
stekern | yeah, it looked pretty straight forward from what I read on wikipedia | 06:29 |
stekern | your mandelbrot needs libgomp, which I don't have though ;) | 06:30 |
poke53282 | Yes. remove --disable-libgomp | 06:30 |
poke53282 | from the gcc configuration | 06:31 |
stekern | from? | 06:31 |
poke53282 | jor1k.com/packages/gcc.tar.bz2 in around one minute available | 06:32 |
poke53282 | now | 06:33 |
stekern | ah, well I'm alredy building it here ;) | 06:34 |
stekern | poke53282: hmm, mandelpar crashed with a bus error here | 06:55 |
poke53282 | mmap and pointer is a short int? | 06:55 |
poke53282 | no char | 06:55 |
poke53282 | But probably this is not the reason. | 06:55 |
poke53282 | this would lead to a segmentation fault I guess. | 06:56 |
stekern | http://pastie.org/9692715 | 06:57 |
stekern | it's only finding 1 core too | 06:57 |
poke53282 | cat /sys/devices/system/cpu/online | 06:58 |
poke53282 | what is the output? | 06:58 |
poke53282 | the output makes no sense. If he is finding only one core the thread id should be always zero and not two. | 06:59 |
stekern | 0-3 | 06:59 |
stekern | do you have a compiled version of mandelpar I can try | 06:59 |
poke53282 | yes, in the emulator copy /usr/bin/mandelpar to ~ | 07:00 |
poke53282 | then click on this little box under the terminal and you get a .tar file | 07:00 |
stekern | unrelated, but why are the "SJK DEBUG" printouts related to the timer showing c=0? | 07:02 |
poke53282 | ahh, because the retiurn value of this timer is always zero. But writing to ttcr is ignored for the smp kernel. | 07:03 |
poke53282 | So ttcr is the same for each core. | 07:03 |
poke53282 | in my implementation. | 07:03 |
poke53282 | And it is increased like a global timer. | 07:04 |
stekern | I see | 07:06 |
poke53282 | It's also a hack :) | 07:06 |
poke53282 | Neither better nor worse than yours. | 07:06 |
poke53282 | I ignore your hack and make my own. :) | 07:07 |
stekern | yours is slightly cleaner, but would have been hard to make that work in hw | 07:08 |
stekern | same thing happens with your mandelpar | 07:09 |
poke53282 | and if you exchange libgomp? | 07:10 |
stekern | the "Using 1 cores" is in jor1k too | 07:10 |
poke53282 | Hmm, I remember, that this was correct at some point. | 07:11 |
poke53282 | But I have never really checked | 07:11 |
poke53282 | later | 07:11 |
poke53282 | yes, you are right. It shows always zero. | 07:14 |
stekern | maybe it's just because I have 640x480-8 | 07:14 |
poke53282 | omp_get_num_threads: Size of the active team | 07:16 |
poke53282 | I execute this function outside of the loop | 07:16 |
poke53282 | I should have used omp_get_max_threads | 07:16 |
poke53282 | So, the output is correct. | 07:17 |
poke53282 | Just executed the wrong function. | 07:17 |
poke53282 | at the wrong place. | 07:17 |
poke53282 | What you should get is a segmentation fault and not a bus error? | 07:17 |
poke53282 | But yeah, this is what I said before. I use a 16 bit framebuffer and don't check in the program. | 07:18 |
poke53282 | But this error shouldn't happen in line 0 | 07:18 |
stekern | does it start from line 0 | 07:19 |
stekern | that's what it says | 07:19 |
poke53282 | yes, so it writes at the beginning of the framebuffer. This should be Ok | 07:20 |
poke53282 | I hope you removed the -mhard-float | 07:20 |
poke53282 | in the compilation command. | 07:20 |
stekern | doh... | 07:21 |
stekern | how did I not see that? ;) | 07:21 |
poke53282 | It's time that you include the fp instructions. | 07:24 |
stekern | yeah, I should test the withfpu branch anyway | 07:25 |
poke53282 | It's the 20th anniversary of the fdiv bug. Would be nice to see a reimplementation of this bug. | 07:27 |
poke53282 | ;) | 07:28 |
poke53282 | sleeping time | 07:33 |
stekern | night | 07:36 |
olofk | stekern, poke53282 : What have I said about talking when I'm not here? It's very rude to make me start my morning by spending half an hour reading through the backlog | 08:00 |
olofk | oh... and awesome work btw :) | 08:00 |
stekern | ;) | 08:07 |
stekern | I think I've figured out why my in-kernel emulation of the atomic ops fails too | 08:08 |
stekern | it's those darn fix-ups | 08:08 |
stekern | i.e. when an in-kernel l.lwa tries to access a userspace area and fails | 08:10 |
wallento | I have put the landing page on the main wiki page at opencores.org: http://opencores.org/or1k/Main_Page | 09:44 |
wallento | we should fill it up now | 09:44 |
wallento | ;) | 09:44 |
maxpaln | Hi, a while ago I remember having a discussion about improving the wishbone BFM transactor testbench (the one that fires random read/write traffic at a wishbone interface) | 12:00 |
maxpaln | I want to implement many of these changes as part of the improvements to the DDR3 memory interface | 12:01 |
maxpaln | I just wanted to check that no one has already done this before I duplicate effort | 12:01 |
maxpaln | or rather, I wanted to check whether anyone has improved the wishbone transactor recently (since this summer). | 12:02 |
olofk | maxpaln: No work has been done on that, but I'm very interested in improvements as well since I'm also working on some memory controller improvements | 12:10 |
olofk | Those might be of interest for you as well actually | 12:10 |
maxpaln | oh really? | 12:16 |
maxpaln | The improvements I am planning to the transactor are along the lines of making the transactions completely random | 12:17 |
maxpaln | I am thinking a good approach is to break the tests into a range of addresses, say 1000 (or even 10 its up to the user really). | 12:18 |
maxpaln | at the start of the test do a burst write across the full range of addresses - this will initialise the wishbone memory interface and a local buffer to the same values | 12:18 |
maxpaln | then allow the testbench to generate random read and write bursts to the memory. | 12:19 |
maxpaln | all writes will be mirrored in the local buffer so that any reads should produce a match | 12:19 |
maxpaln | the number of transactions per test (i.e. the number of transactions carried out on the current range of addresses) is probably something the user will define | 12:20 |
maxpaln | but after that everything will be random: whether the transaction is read or write, what burt type, what length etc. | 12:20 |
maxpaln | I'll also implement stekern's suggestion of random number of wait states between transactions (between 0 and a few). | 12:21 |
maxpaln | this should make the transactor pretty representative of a real world system *I think* | 12:21 |
olofk | maxpaln: That sounds all good, and I would very much like to have that | 12:21 |
maxpaln | well, hopefully I'll have something in a few days - I have some time between other stuff this week | 12:22 |
olofk | That would be awesome | 12:22 |
olofk | Have you looked at wb_sdram_ctrl? | 12:22 |
maxpaln | erm, I think so - is that the one with the mulitple ports | 12:23 |
olofk | Yep | 12:23 |
maxpaln | yeah, that was a neat implementation - but there was something about it that made it tricky for me to reuse. | 12:23 |
olofk | and it has a small cache for values that are read back from RAM | 12:23 |
maxpaln | ah, yeah - I remnember, that bit was very nice | 12:23 |
olofk | What I'm doing now is pulling out the arbiter and cache stuff to a separate component with a standardized interface to the mem controller | 12:24 |
maxpaln | Oooh, nice | 12:24 |
olofk | The interface will be two write channels (one for command and address, and one for data and mask), and one return channel with data and address | 12:24 |
maxpaln | cool | 12:25 |
olofk | It will also be possible to have different widths for wishbone and the mem controller | 12:25 |
maxpaln | neat | 12:25 |
olofk | But I can't give you an ETA on that, so you should probably keep your single wishbone port for now | 12:26 |
olofk | But once it's ready it probably shouldn't be too hard to just hook up your ddr controller to this interface | 12:26 |
maxpaln | cool - no rush, my controller is working well at the moment. I am trying to simplify it a little at the moment before adding support for arbitrary burst lengths on incrementing bursts | 12:29 |
maxpaln | but i've broken something in HW that works in Sim - hence the rework of the transactor | 12:30 |
maxpaln | I'm bored of the logic analyser for now ! | 12:32 |
olofk | I know the feeling :) | 12:32 |
poke53282 | olofk: Yes, we were busy this weekend | 16:30 |
poke53282 | Hmm, interesting. when the ompic times out, the destination cpu is either at pc=0x500 (tick timer) or in an raw_spin_lock_irqsave. | 17:10 |
poke53282 | Let me guess, where this spinlock is :) | 17:11 |
poke53282 | stekern: Ok the following. ompic tries to raise an irq on cpus, which are in the irqsave spinlock of this device. | 17:14 |
poke53282 | ooops, I guess being in the main loop of raw_spin_lock_irq with disabled interrupts is wrong. | 19:09 |
olofk | ehm... has anyone heard of the SYMPL FP3250 GPGPU before? | 19:11 |
olofk | Google doesn't give me any info | 19:11 |
poke53282 | nope | 19:11 |
olofk | I can't really make out if this is a serious thing or not | 19:13 |
poke53282 | If google can't find it doesn't exists. | 19:14 |
poke53282 | stekern: What I know is, that all the cores, that don't react are stalled somewhere in raw_spin_lock_irq or in raw_spin_lock_saveirq. Everytime with disabled interrupts. The functions with the spinlocks are mainly ompic_raise_softirq, rcu_process_callbacks and schedule. | 20:33 |
stekern | poke53282: ok, but the one that is holding the lock should be releasing it at some point | 21:07 |
stekern | unless it tries to send a second IPI to the same core before the first one has got through, but IIRC that shouldn't be possible | 21:08 |
stekern | I could be wrong though | 21:10 |
stekern | the IPI stuff is here: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/kernel/smp.c?h=smp | 21:16 |
poke53282 | I wish I would know what's going on. | 21:29 |
poke53282 | The statistics are clear. Everytime the cpu which is called hangs in a spinlock | 21:29 |
stekern | but obviously that is what is happening for you? one cpu is trying to send a second IPI to a core that hasn't yet received the first one? | 21:32 |
stekern | maybe that's expected... that's problematic then | 21:32 |
poke53282 | By the way. You implemented the emulation of swa and lwa in the kernel with the virtual address and not the physical. | 21:32 |
stekern | yes | 21:33 |
stekern | that was intended, is that a problem? | 21:33 |
poke53282 | No, not really. My hardware implementation uses the physical address. | 21:34 |
poke53282 | You told me so. | 21:34 |
poke53282 | Everytime you send via ompic an irq signal to a cpu which has not acknowledged the previous request, I check the pc of the destination cpu | 21:35 |
stekern | right, so does mine... maybe that is a problem | 21:35 |
poke53282 | And everytime, the pc shows me, that the cpu hangs in an spinlock. | 21:36 |
poke53282 | stekern: Not sure, But I don't use the emulation. So I don't care at the moment. Just wanted it to mention. But this is not the problem here. | 21:36 |
stekern | ...but only a theoretical one I think | 21:37 |
poke53282 | Ok, one question about the spinlock here: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/spinlock.h?h=smp#n32 | 21:38 |
stekern | the emulation isn't perfect anyway, a normal store to the same address will not break the link | 21:38 |
poke53282 | lock->slock should increase by (1<<TICKET_SHIFT) everytime a process enters the routine. | 21:39 |
stekern | right | 21:39 |
stekern | (emulation) the important check is that the flag is still set, the address compare only serve as a slight sanity check | 21:40 |
poke53282 | And if both processes enter at the same time, the lwa and swa part should make sure, that it is not only increased one time. | 21:41 |
stekern | right | 21:41 |
poke53282 | and this I can implement by only using one linked register for all cores. | 21:43 |
stekern | is that a statement or a question? ;) | 21:43 |
poke53282 | question. | 21:44 |
poke53282 | my cores are still working sequential. Just switching now and then by a very complex ruling scheme. | 21:45 |
stekern | how do you determine what core has done the link? | 21:45 |
poke53282 | I set the linked register to -1 | 21:46 |
poke53282 | Because I don't support 4GB of RAM, this should be safe. | 21:46 |
stekern | so you mean the address? | 21:46 |
poke53282 | yes | 21:46 |
poke53282 | I don't have a flag. | 21:47 |
stekern | ok, so you've combined the address and the flag to one variable, that is -1 when the flag is cleared. That's fine. | 21:47 |
poke53282 | Yes | 21:47 |
stekern | but still, how do you know which core the register belong to? | 21:47 |
poke53282 | I don't know at the moment. | 21:48 |
stekern | and even if they are running sequentially, how do you link two cores at the same time? | 21:48 |
stekern | (imagine two unrelated spinlocks) | 21:48 |
poke53282 | yes, two unrelated spinlocks would clear the register for the other spinlock. | 21:49 |
poke53282 | But usually this triple lwa,add,swa is executed at once for one core. | 21:49 |
stekern | that was my next question, can the opposite happen? | 21:50 |
poke53282 | the trylock might be a problem here. | 21:50 |
poke53282 | The question is, if the flags or linked address registers are not connected, how can the lwa,add,swa triple work on multicore systems? | 21:51 |
poke53282 | if two cores enter at the same time. | 21:52 |
stekern | well, they are connected in the sense that a store (from another core) to the linked address will break the link | 21:53 |
poke53282 | that means, formally if I have 16 cores, all 16 linked addresses must be checked for one swa? | 21:54 |
stekern | so, if two cores enter the lwa,add,swa at exact the same time, the one that get first to the swa succeed, all other fail | 21:54 |
stekern | in hardware, this is handled by the snooping. and the databus can only be held by one cpu at a time | 21:55 |
stekern | so if all cpus assert the store at exactly the same time, the arbiter will decide who wins | 21:55 |
stekern | yes, I think you have to do it like that | 21:58 |
stekern | check all 16 linked addresses | 21:58 |
poke53282 | Ok, will try. | 22:00 |
poke53282 | But I hope, I have to do it only in the l.swa part. Not for l.sw as well | 22:01 |
stekern | to make it faster, maybe splitting up the link flag into one variable that you can AND with and only actually check the addresses in a slow path | 22:02 |
poke53282 | yes, something like a linkflag_bitfield | 22:02 |
stekern | maybe you can do without checking on normal stores, but consider how the spinlock unlock works: http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/spinlock.h?h=smp#n83 | 22:04 |
poke53282 | Implemented, but nothing changed. | 22:04 |
stekern | if that happens when another core is at the l.add and the l.swa is actually performed, you will overwrite what the unlock did | 22:05 |
poke53282 | Yes, but the unlock doesn't use lwa and swa. And the check for this variable is done in a ordinary while loop. So, this is not a problem. | 22:06 |
poke53282 | ? | 22:06 |
stekern | lock->tickets.owner is the same as lock->slock & 0xffff | 22:07 |
poke53282 | But in one case, it is lock->slock and in the other one lockval.tickets.owner | 22:07 |
poke53282 | Hmmm | 22:07 |
stekern | http://git.openrisc.net/cgit.cgi/stefan/linux/tree/arch/openrisc/include/asm/spinlock_types.h?h=smp#n11 | 22:07 |
poke53282 | Ok, a union | 22:08 |
poke53282 | can we use lwa,addi,swa,bnf for lock->tickets.owner++? | 22:09 |
stekern | I guess you could, but it shouldn't be needed ;) | 22:11 |
stekern | lock->tickets.owner++ will be a l.lhz;l.sh | 22:12 |
poke53282 | Let's see. I prevent changing the core while the linked register flag is set. | 22:13 |
stekern | so if the lock lwa,add,swa performs the swa between the lhz and sh, it doesn't matter, since it didn't change the lower 16-bit | 22:13 |
poke53282 | Ok | 22:15 |
stekern | if you do that, you'll run into problems with the trylock | 22:15 |
stekern | since that can set the flag without 'clearing' it (by the swa) | 22:15 |
poke53282 | Yes, I see | 22:16 |
poke53282 | This didn't solve the problem anyhow. | 22:17 |
poke53282 | If I change the core after every instruction it always hangs at some point. | 22:18 |
poke53282 | Yes, the cpu hangs in a raw_spin_lock | 22:24 |
poke53282 | Next time, can you invent atomic functions, which are easier to handle? ;) | 22:25 |
poke53282 | More emulator friendly. | 22:25 |
stekern | a cmpxchg is probably a lot easier to implement in an emulator, yes ;) | 22:36 |
poke53282 | At least I make progress. When I switch the core after each instruction, they end up in the same spinlock forever. | 22:47 |
--- Log closed Tue Nov 04 00:00:13 2014 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!