--- Log opened Wed Aug 20 00:00:15 2014 | ||
stekern | dalias: didn't you disguss that with poke53281 already? | 03:16 |
---|---|---|
stekern | *discuss | 03:17 |
stekern | I get: alignof.c:6:31: error: 'max_align_t' undeclared (first use in this function) | 03:18 |
poke53281 | you have to take the stddef from gcc. | 04:12 |
poke53281 | and in principle yes, we discussed this. | 04:12 |
poke53281 | the max_align_t was 4 and the sizeof I think was 8. It's somewhere in the logbook. | 04:13 |
poke53281 | sorry, in the chatlogs | 04:13 |
dalias | size is definitely not 8 | 04:45 |
dalias | the min possible size is 16 (8 bytes each for long long and long double) | 04:45 |
dalias | yeah i was just unclear on the alignment. i know the natural alignment of those types on or1k is 4, but it's also 4 on i386, whereas max_align_t gets alignment 8 on i386 | 04:46 |
dalias | because gcc has this stupid concept of separate in-struct/out-of-struct alignments and __alignas__ uses the "out of struct alignment" | 04:46 |
dalias | so i was trying to confirm whether the "out of struct alignment" on or1k is also 4 for these types | 04:47 |
dalias | and according to jor1k's gcc, it is | 04:47 |
dalias | anyway, in order to have the public headers not depend on stupid gcc implementation-internals that should never have leaked (the whole point of the "out of struct alignment" is that it should be 100% non-observable; observable alignments are supposed to use the "in struct" alignment) | 04:48 |
dalias | i think we just need per-arch definitions of max_align_t | 04:49 |
poke53281 | http://juliusbaxter.net/openrisc-irc/%23openrisc.2014-08-08.log.html 22:04 | 05:00 |
poke53281 | "alignof(max_align_t) => 4 sizeof(max_align_t) => 16" | 05:00 |
poke53281 | with musl-compiled gcc but stddef.h from gcc. | 05:01 |
poke53281 | The jor1k's gcc on the official site is compiled with uclibc. | 05:01 |
poke53281 | I don't think that matters anyhow | 05:03 |
poke53281 | dump question, how did you figure it out? The stddef.h is not included. :) | 05:05 |
poke53281 | Oh wait, there is one :) | 05:09 |
poke53281 | http://pastie.org/9487968 | 05:21 |
poke53281 | I wonder whether sizeof(max_align_t) has any reasonable application, dalias. | 05:40 |
stekern | if I read the stddef.h from gcc, it's only exposed for c11/c++11? | 06:16 |
stekern | +correctly | 06:16 |
stekern | it=max_align_t | 06:16 |
stekern | but yeah, 'in struct' alignment should be 8 for that | 06:17 |
stekern | err... no | 06:18 |
poke53281 | Take my pastie. That works | 06:28 |
poke53281 | It takes the important part from stddef.h | 06:28 |
poke53281 | otherwise you have to take g++. Indeed | 06:28 |
poke53281 | according to the specification, the alignment should be 4. | 06:29 |
poke53281 | The vector types don't matter here. | 06:29 |
poke53281 | because the vector types are no basic types. | 06:30 |
stekern | yes, but the 'in struct' alignment doesn't show from that, only the 'out of struct' (but as I understood it, that was what dalias was interested in) | 06:46 |
stekern | anyway.. I've been annoyed by the alignment rules we have before, why aren't our long long and double aligned to 8 and why don't we have stack alignment of 8? | 06:48 |
poke53281 | hehe, don't know. I am also annoyed by our calling conventions :) | 06:51 |
poke53281 | Probably the or1k ABI was not done by an expert group, but only of one person. | 06:52 |
stekern | you're speaking about the vararg calling convention? | 06:53 |
poke53281 | Yes | 06:53 |
stekern | it's a bit sucky, but really, it's just bad code that breaks with it | 06:54 |
poke53281 | well, at the moment it seems, that directfb is indeed the one and only program were this is important. | 06:54 |
poke53281 | Yes | 06:54 |
stekern | ...directfb should be fixed then | 06:54 |
stekern | regardless of what vararg calling convention we use ;) | 06:55 |
poke53281 | Yes, but at the beginning I was afraid, that a lot of programs use this conventions. Now, I seems, that it is only directfb. | 06:56 |
stekern | iirc dalias pointed out some case on x86(_64) where that kind of assumptions break too | 06:56 |
stekern | http://juliusbaxter.net/openrisc-irc/%23openrisc.2014-07-07.log.html#t03:53 | 06:59 |
poke53281 | that would make sense for a more complicated ABI. e. g. providing the floatiing points numbers directly in the SSE registers or so. But I have never read what AMD suggested as ABI. | 06:59 |
poke53281 | good to know. | 07:04 |
wallento | stekern: everything looks good, thanks! | 07:49 |
stekern | wallento: great! | 07:50 |
stekern | if I just can catch this bug I've been chasing the last week, I'd then like to start moving over multicore stuff into master | 07:51 |
wallento | is it in the snooping? | 07:51 |
stekern | I'm not sure where the bug actually is | 07:51 |
wallento | btw. any planning for multicore handson at ORConf? | 07:51 |
stekern | could even be a sw bug | 07:52 |
stekern | I'm hoping on at least being able to give a Linux SMP demo | 07:52 |
stekern | the problem is that one cpu takes a spinlock twice | 07:53 |
wallento | sh*t | 07:55 |
wallento | sounds awful to debug | 07:56 |
stekern | it's... interesting =) | 07:56 |
wallento | debugging is learning ;) | 07:57 |
stekern | I can catch the occasion where the spinlock is aqcuired for the second time, but I need to get to how it get from between the initial lock/unlock to that | 07:58 |
stekern | by my understanding, that shouldn't be able to happen under the circumstances that lock is held... but as you say, debugging is learning | 07:58 |
stekern | I've made some assumptions that wasn't true already on the way to where I am right now, so I might have made some more mis-assumptions =) | 07:59 |
maxpaln | well, panic() is called from within pcpu_dump_alloc_info() - which in itself is interesting because the C code for that routine doesn't include a call to panic() - pcpu_dump_alloc_info() is only called from within pcpu_setup_first_chunk(). I am guessing pcpu_setup_first_chunk() is expected behaviour - so whatever goes wrong must happen after the jump to this function. | 09:48 |
stekern | maxpaln: http://lxr.free-electrons.com/source/mm/percpu.c#L1141 | 09:50 |
stekern | BUG_ON => panic | 09:50 |
maxpaln | well, that is interesting - the version of percpu.c you pointed me at differs from the one in my kernel! | 09:51 |
stekern | what kernel version are you using? | 09:52 |
maxpaln | ah, no its not! | 09:52 |
maxpaln | but you have made the connection that BUG_ON equates to a jump to panic() - gotcha! | 09:52 |
maxpaln | I am a little confused though - it would seem that BUG_ON will always be called (unless ai->nr_groups == 0, which i am guessing it wouldn't be if everything were working well) which suggests BUG_ON only calls panic if the expression passed to it is true (or possibly false, but I'll assume true) | 09:55 |
maxpaln | ny idea where BUG_ON is defined? | 09:55 |
maxpaln | making more sense now - at least the BUG_ON bit | 09:59 |
ysionneau | to navigate in kernel sources lxr is very convenient : http://lxr.free-electrons.com/ident?i=BUG_ON | 10:00 |
ysionneau | http://lxr.free-electrons.com/source/include/asm-generic/bug.h#L48 < something like that | 10:01 |
maxpaln | ysionneau: you're right, that is much more convenient!! | 10:01 |
stekern | as you found out, BUG_ON means literally what it says, 'there is a bug on the condition within the ()' | 10:07 |
maxpaln | :-) yeah, sometimes SW really does do what it says on the tin! | 10:07 |
stekern | =P | 10:07 |
maxpaln | hmm, so panic gets called because gi->nr_units is not equal to upa. Since neither of these variables means much to me I am not sure I can infer much from that. ANyone got any clues? Otherwise, back to the HW... | 10:12 |
maxpaln | well, not so much equal as an even divisor... | 10:14 |
stekern | it's not an ideal place to get a crash... you could try to locate from the asm where those values are loaded and check those loads | 10:15 |
maxpaln | yeah, I was coming round to that approach... | 10:16 |
maxpaln | I do like the implication that there are some *ideal* places to get a crash! :-) | 10:16 |
stekern | well, a crash in a small function that is not inlined because of an access to a global variable, that's pretty close to ideal ;) | 10:18 |
maxpaln | :-) | 10:19 |
maxpaln | ok, I think I might be onto something | 10:46 |
maxpaln | making a few leaps of faith - I think I have identified the section of assembler that relates to that modulo expression BUG_ON(gi->nr_units % upa) | 10:48 |
maxpaln | c007e158:84 d6 00 20 l.lwz r6,0x20(r22) | 10:48 |
maxpaln | c007e15c:e0 a6 93 09 l.div r5,r6,r18 | 10:48 |
maxpaln | c007e160:e0 45 93 06 l.mul r2,r5,r18 | 10:48 |
maxpaln | c007e164:e0 c6 10 02 l.sub r6,r6,r2 | 10:48 |
maxpaln | c007e168:bc 26 00 00 l.sfnei r6,0x0 | 10:48 |
maxpaln | My spidey-sense immediately picked up the presence of the l.mul since I already have suspsicions about that | 10:49 |
stekern | mmm | 10:49 |
maxpaln | tracking through the analyser (and this bit is tricky because the register reads/writes are out of sequence) I can see that r5 and r18 are both 1 but r2 gets written with 0 | 10:49 |
maxpaln | I thought I had run a mult testbench though - maybe I should go back and check the results... | 10:50 |
stekern | at what point is r5 1? | 10:51 |
maxpaln | it gets written with 1 on the clock cycle after the l.div completes | 10:52 |
maxpaln | oh, hang-oin - this is interesting | 10:53 |
stekern | it should be 2 cycles | 10:53 |
maxpaln | yep, two clock cycles you are right | 10:54 |
maxpaln | so, this is interesting - | 10:54 |
maxpaln | at the start of the l.mul r5 is read | 10:54 |
maxpaln | and contains 0 | 10:54 |
maxpaln | then 1 clock cycle later the result of the l.div writes it with 1 | 10:55 |
stekern | yeah, but that's all expected (and accounted for) | 10:55 |
maxpaln | OK - so the l.mul operation on values of 1 and 0 really should return 0 | 10:56 |
stekern | you can't look at the register file RAM outputs, you have to look at the output to execute stage | 10:56 |
maxpaln | is that right? | 10:56 |
stekern | because the result from the div will be forwarded to the mul | 10:56 |
maxpaln | aha - ok, the output to the execute stage - let me see if I can find that. | 10:56 |
maxpaln | that would be better than trying to track the RAM outputs | 10:57 |
stekern | it's execute_rfa_o and execute_rfb_o | 10:57 |
stekern | what version/commit of mor1kx are you using? I've made pretty significant changes to this area lately | 10:58 |
stekern | (but those signals are still named the same, so that info holds | 11:00 |
stekern | ) | 11:00 |
maxpaln | hmmm, what's the best way to check - I checked it out a few weeks ago, July 29th | 11:00 |
ysionneau | cat .git/HEAD | 11:05 |
maxpaln | [root@localhost mor1kx]# cat .git/HEAD | 11:05 |
maxpaln | ref: refs/heads/master | 11:05 |
ysionneau | right :p | 11:06 |
ysionneau | git rev-parse HEAD | 11:06 |
ysionneau | sorry | 11:06 |
maxpaln | :-) - I looekd at it and thought 'that isn't what he meant' but since I don't really know my way around git I decided to offer it up anyway | 11:07 |
ysionneau | it's my fault, I should have thought about 'what if your on a branch tip' | 11:07 |
ysionneau | try git rev-parse HEAD | 11:08 |
stekern | ...or just git log | 11:08 |
ysionneau | yup | 11:09 |
maxpaln | hmmm, unusually my linux machine is doing something instead of responding to me.... | 11:09 |
maxpaln | [root@localhost mor1kx]# git log | 11:11 |
maxpaln | commit 17afdf2250c5dc4853b4032622a78c14b8dd3ec3 | 11:11 |
maxpaln | Author: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> | 11:11 |
maxpaln | Date: Tue Jul 22 16:27:42 2014 +0300 | 11:11 |
maxpaln | think it could be lunch time while I rebuild a new FPGA image | 11:12 |
maxpaln | back shortly... | 11:12 |
stekern | ok, that's before the rework | 11:15 |
stekern | so, I wouldn't expect you to find anything I might have broke then ;) | 11:15 |
maxpaln | :-) | 11:33 |
maxpaln | ok, back on it now | 11:48 |
maxpaln | so at the start of the l.mul instruction execute_rfa_o and execute_rfb_o are both 1 - which is consistent with my earelier conclusions | 11:48 |
maxpaln | and on the last clock cycle of the l.mul instruction, r2 gets written with 0 | 11:49 |
maxpaln | I am assuming the store to the destination register happens on this last clock cycle of the intruction | 11:50 |
maxpaln | oh, hang on | 11:50 |
maxpaln | misreading the lines | 11:50 |
maxpaln | r2 gets written two clock cycles after the and of l.mul - it still gets written with 0 though | 11:50 |
maxpaln | so, at this stage I am thinking this is incorrect and that the execution of the l.mul instruction is not right | 11:51 |
maxpaln | Before I dive into this, would you agree? | 11:51 |
stekern | yeah | 11:54 |
stekern | you could look at mul_result too | 11:55 |
maxpaln | coolio | 11:57 |
maxpaln | well, that was considerably more difficult than it should have been - mainly due to my own dipshittery | 14:01 |
maxpaln | but I can now see very clearly that mul_result is zero for all time | 14:01 |
maxpaln | I can see op_mul_i assert, the two operands set to 1 | 14:01 |
maxpaln | and the result stay as 0 | 14:02 |
maxpaln | so diving into how this has been implemented on our hardware I think the problem is that the format of teh RTL doesn't really map into our dedicated HW blocks very well | 14:02 |
maxpaln | it isn't clear to me why the result is always zero - I am sure there is a reason, but I guess the real question is what is the best way to resolve. | 14:03 |
maxpaln | The PIPELINED implementation doesn't really look to map that well either - the addition of the ctrl signal to gate the multiplication is unlikely to map into anything sensible in silicon. | 14:04 |
maxpaln | As a local fix I could easily add another implementation, something like LATTICE_MULT that simply instantiates a HW macro | 14:05 |
maxpaln | but its not that portable and not ideal for many reasons | 14:05 |
maxpaln | reviewing the synthesis and map logs to see if there are clues about what it is in the RTL that gets mapped incorrectly.... | 14:11 |
maxpaln | Hmmm, well at the very least this RTL looks to map inefficiently to our architecture in Synplify. A 32x32 mult should use two Multiplier slices - Synplify maps the mor1kx to 4. I suspect that somewhere in there is the root cause of my problems. | 14:34 |
maxpaln | So I think I might understand the problem - | 14:41 |
maxpaln | this line is, I think, aiming to keep the result of the MULT to the width of the operands (32 in this case, but it seems the code has at least provisioned for 64-bit): | 14:43 |
maxpaln | mul_result1 <= (mul_opa * mul_opb) & {OPTION_OPERAND_WIDTH{1'b1}}; | 14:43 |
maxpaln | I suspect that this is confusing Synplify since our multiplier allows the result to naturally grow to the full bit width of the MULT operation - 64-bit in this case (128-bit in 64-bit mode) | 14:44 |
maxpaln | By truncating the result to 32-bits using '& {OPTION_OPERAND_WIDTH{1'b1}', Synplify appears to be splitting the MULT into two operations that each produce a 32-bit result. | 14:46 |
maxpaln | This wouldn't necessarily explain the 'always zero' result but it would explain why Synplify instantiates 4 MULTs instead of 2. | 14:47 |
maxpaln | running an experiment now to test my hypothesis. | 14:47 |
stekern | if that '& {OPTION_OPERAND_WIDTH{1'b1}};' causes problems, I don't see a reason to keep it | 15:18 |
maxpaln | well, it isn't clear what the root cause is at the moment. Synplify is definitely the culprit but everything I try produces a different result to what I expect. I have discovered that the cause of the always zero result is the clock enables to the ALU section of our DSP block being tied to GND - never helpful!! But a simpler testcase based on your code didn't do the same thing! :-| | 15:20 |
stekern | fun ;) | 15:21 |
maxpaln | yep :-) | 15:28 |
maxpaln | yowser - first breakthrough | 15:52 |
maxpaln | took a while | 15:52 |
maxpaln | the implementation of mul_signed_overflow is at the heart of it | 15:52 |
maxpaln | Synplify tries to merge this into the instantiation of the DSP block but makes a mess of it and ends up having no clock enable on the ALU instance. I am wondering if the presence of the pseudo clock enable op_mul_i on the operand register is causing the issue!! | 15:53 |
maxpaln | hmm, gotta say- this is feeling like a synthesis bug rather than a mapping issue! | 16:07 |
maxpaln | ok, false alarm - misunderstanding on my part. Back on the hunt... | 16:11 |
maxpaln | actually, I've changed my mind again - I think this is a Synthesis bug. But the fact I can't make up my mind suggests I should come back to this with a fresh pair of eyes tomorrow. | 16:18 |
ysionneau | :) | 16:19 |
maxpaln | The problem appears to be that on one of the two DSP slices Synplify attaches neither a clock nor a clock enable to the ALU - this results in an all zero output almost certainly triggering the overall result of zero. I can't see how a coding style could produce this results - but a fresh look tomorrow should help. | 16:20 |
stekern | yeah, that sounds like a bug alright. | 16:22 |
stekern | but it's not like we don't have work-arounds for other bugs in the tools in the codebase ;) | 16:23 |
maxpaln | :-) yeah, I am sure I can work around it - but it's nice to have it reproduced in a small testcase. I think I'll submit it to the SW group and let them deal with SYnplify - it seems to get fixed quicker that way. I'll think about a way to work around it tomorrow... | 16:25 |
dalias | stekern, re: alignment, i think it's a mistake for any new ABI to define the alignment of any fundamental type as anything other than the largest power of two which divides the size of the type (which, for most types, is just the size of the type) | 17:05 |
dalias | (note: that's the max possible alignment) | 17:06 |
dalias | stekern, for maximum extensibility without having to change the ABI, you want the max possible alignment. | 17:06 |
dalias | unfortunately 'fixing' it for or1k looks like it would be pretty much work | 17:06 |
stekern | yes, I agree | 18:42 |
--- Log closed Thu Aug 21 00:00:16 2014 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!