--- Log opened Wed Jul 09 00:00:11 2014 | ||
dalias | stekern, re: your setjmp/longjmp, the offset in sigsetjmp.s has to match the size of __jmp_buf, not the portion that happens to be used | 00:22 |
---|---|---|
dalias | so you either need to change the size of __jmp_buf or revert that change, i think | 00:22 |
dalias | generally we try to match the ABI glibc uses (so, use whatever __jmp_buf size it has). that's less important for archs that aren't already in widespread use, but still probably the best approach. | 00:24 |
stekern | dalias: ah! I had a creeping feeling that there was something I had forgot to change! | 02:15 |
stekern | I think in this case, it would almost make most sense to do the change and align glibc accordingly | 02:16 |
stekern | blueCmd: do you have any opinion on that | 02:16 |
stekern | btw, I'm running libc-tests in or1ksim now, and I get failures in pthread tests there too. I'm on to finding out why. | 02:26 |
dalias | which ones? | 03:36 |
dalias | i could probably offer you some ideas | 03:37 |
stekern | pretty much all of them | 03:43 |
stekern | initial observation is that when a new thread is created, the data pointed at by the tp (r10) isn't correct | 03:44 |
stekern | the particular test I'm looking at now is pthread_tsd.c | 03:44 |
stekern | so, it fails in pthread_setspecific() when it's called from start() | 03:45 |
dalias | could the order of the clone args be wrong? | 03:46 |
dalias | strace could show you this | 03:46 |
stekern | that would be a sensible to look into if nothing else | 03:47 |
stekern | to be even more specific, this is the line that fails: https://github.com/skristiansson/musl-or1k/blob/master/src/thread/pthread_setspecific.c#L8 | 03:47 |
dalias | it crashes there? | 03:49 |
stekern | yes, tsd == 0 there | 03:52 |
stekern | sorry, it's on the line above (L7) it crashes | 03:52 |
stekern | no, L8. it's the write that causes a segfault | 03:54 |
stekern | not so important though, more important is to find out why tsd == 0 | 03:54 |
dalias | probably __pthread_self() is returning the wrong thing | 03:58 |
dalias | possibly because clone set the wrong value for the thread pointer register | 03:58 |
stekern | yes | 04:00 |
dalias | can you tell if clone messed it up? | 04:09 |
stekern | I think clone is fine, I'm suspecting that my TP_ADJ in pthread_arch is perhaps not correct | 04:16 |
stekern | no, the clone args *are* screwed up | 04:25 |
stekern | I'm passing tls in r6, it should be in r7 | 04:25 |
dalias | :) | 04:26 |
dalias | that's one of the classic mistakes | 04:26 |
dalias | because the clone arg order is randomly permuted, for NO REASON, on every arch | 04:26 |
stekern | yeah... it's messy | 04:29 |
dalias | btw sh is even more fun.... | 04:32 |
dalias | even with the args in the right order it didn't work | 04:32 |
dalias | because the kernel has a bug, which has been present ever since the linux sh port was first added, where it clobbers the place where the syscall arguments are stored on the stack | 04:33 |
stekern | heh, why not? | 04:33 |
stekern | move that question above your answer ;) | 04:34 |
dalias | someone trying to be clever actually overlapped the pt_regs structure in the caller (the syscall entry point) with the argument slots on the stack in the callees (the functions that implement the syscalls) | 04:34 |
dalias | and didn't understand that the compiler is free to clobber its incoming argument space | 04:35 |
dalias | so the pt_regs structure gets clobbered | 04:35 |
stekern | fun times... | 04:35 |
dalias | sorry i didn't state that clearly -- what i mean is that the incoming argument slots on the stack are local objects of the callee and the callee is free to clobber them at any point where it no longer needs their values | 04:36 |
stekern | right | 04:37 |
dalias | anyway i think the fix is upstream now... :) | 04:37 |
dalias | but apparently nobody ever even tried using clone on sh... | 04:37 |
stekern | that sounds strange | 04:38 |
dalias | yeah i had a hard time believing it too, but it was all we could figure | 04:39 |
dalias | and making the syscall entry point adjust the stack properly so pt_regs doesn't overlap, and add local copies of the args, suddenly made clone work ! | 04:40 |
stekern | ok, with the args in right order a lot of the pthread tests now passes | 04:48 |
stekern | still some problem with pthread_cancel | 04:50 |
dalias | what's failing? | 05:03 |
stekern | not sure yet, got some $dayjob work to do atm, I'll take a look at it in a while | 05:05 |
dalias | *nod* | 05:07 |
maxpaln | Hi, back debugging my Linux issue - | 07:55 |
maxpaln | I am starting to understand the mechanism behind this a bit more - | 07:55 |
maxpaln | I can see the EPCR switch to 0xC0004B14 - in the disassembled Linux binary this translates to a section in the <arch_local_irq_restore> section | 07:56 |
maxpaln | When this instruction is executed the OR1200 Instruction Bus address reads 0x00004B14 - which makes sense, the virtual Linux address of 0xC.. becomes a normal memory address of 0x0... | 07:58 |
maxpaln | it doesn't read from memory so I am guessing this is pulled from Cache | 07:58 |
maxpaln | I wanted to check that the correct data is retreived - make sure it matches with the expected instruction from the Linux binary | 07:59 |
maxpaln | is it a fair assumption that I can check the Instruction cache data output to see this instruction? | 07:59 |
maxpaln | It takes a little while to modify the logic Analyser code and if it is a pointless exercise I'd rather not bother... | 07:59 |
stekern | you can, but I wouldn't bother | 08:00 |
stekern | the instruction is supposed to write to SR, and it does that, so I expect it is correct | 08:01 |
stekern | what you want to do is backtrack why the value that is getting written to SR is wrong | 08:01 |
maxpaln | ah, ok - | 08:01 |
maxpaln | so, if I understand the assembler correctm the value written to the SR is what eventually ends up in r3. | 08:14 |
maxpaln | which looking through the rest of the assembler is always written prior to jumping to this section of code. | 08:15 |
maxpaln | so I am guessing whatever is calling this section of code has somehow set r3 incorrectly. | 08:16 |
maxpaln | more digging required | 08:16 |
stekern | yes | 08:16 |
maxpaln | hmmm, well we jumped here in this case from a section of code called <console_unlock> | 08:21 |
maxpaln | ok, so we are getting to this point via a vprintk_emit call - but this isn't the first call to that function. I can see plenty of previous calls that don't trigger the same behaviour. Interestingly, there is no output on the UART throughout any of this. | 09:29 |
stekern | you can probably assume that this is an hw error, not an sw error | 09:32 |
maxpaln | yes, I had come to that conclusion | 09:32 |
maxpaln | not least because I did a compare between my known working linux kernel and this one - these sections are identical | 09:32 |
stekern | I would guess that a memory access goes wrong somewhere (or a bug in or1200), that makes context store/restore go wrong | 09:32 |
maxpaln | ...interesting | 09:33 |
maxpaln | The memory controller is the biggest change since I last had this working | 09:34 |
maxpaln | although to be fair there have been a LOT of changes. not least the switch to a whole new family of silicon | 09:34 |
maxpaln | but the device has been tested by our factory and the majority of the code remains identical | 09:35 |
maxpaln | the memory controller is entirely new - as is the DDR3 IP controller - so I guess these are the prime culprits | 09:35 |
stekern | ideally you want to find the spot where in the user space code r3 still holds correct value and the spot where it's wrong and then try to capture what's in between | 09:35 |
maxpaln | Yep, although knowing what is the 'correct' value isn't that easy :-) | 09:36 |
stekern | well, if you can find the corresponding arch_local_save_flags to that arch_local_irq_restore, you will now | 09:37 |
stekern | *know | 09:37 |
maxpaln | aha - good tip, going back in... | 09:38 |
stekern | not userspace code, but the non-preemptied kernel code I meant up there | 09:40 |
maxpaln | ok, I need a little direction here - | 10:00 |
maxpaln | inside console_unlock there are two calls to arch_local_save_flags | 10:01 |
maxpaln | so the problem could be anywhere after the first (assuming the problem doesn't happen earlier in the code). | 10:01 |
maxpaln | What I'd like to do is trace the value of r3 and maybe a few others in HW - but I don't know where to find them or what code is involved in the context store/restore process. | 10:02 |
stekern | but aren't there also a second acompanying arch_local_irq_restore to that second arch_local_save_flags? | 10:05 |
maxpaln | yes, but lets approach from the other way around - the problem occurs during execution of arch_local_irq_restore, the call to this instance of that function occurred from address 0xC0042410 | 10:12 |
maxpaln | immediately prior to that call to arch_local_iq_restore there isn't an arch_local_save_flags | 10:13 |
maxpaln | the previous one is still within the console_unlock function but it is way back on address 0xc0042004 | 10:14 |
maxpaln | there doesn't appear to be a branch to this point in the code that has an associated arch_local_store_flags | 10:15 |
maxpaln | so I am concluding the call to arch_local_store_flags happens earlier in the console_unlock function. | 10:16 |
maxpaln | so far so good, except there are two calls to arch_ocal_save_flags | 10:17 |
stekern | ah, yeah I see what you mean, console_unlock isn't trivial to read in asm | 10:18 |
maxpaln | [glad it isn't just me :-) ] | 10:18 |
stekern | so, a better/easier approach to debug this is probably to keep the trigger you have, and try to capture r3 as far back as you can and see if you can get a lock on the spot where it's written wrongly | 10:20 |
maxpaln | yep, that was where I started to go - and then struggled to find r3 in HW | 10:20 |
maxpaln | I can find the SPRS | 10:21 |
maxpaln | but I am struggling to find the GPRs | 10:21 |
stekern | capturing register values is not entirely trivial though, the registers are contained in a RAM, so you have to either implement the registers as flops, or add a tap to the register file write to r3 | 10:21 |
maxpaln | aha - its or1200_rf.v | 10:22 |
stekern | where they are in the source in or1200, bets me, I haven't looked at that source for at least a couple of years, and only briefly then | 10:22 |
stekern | ah, good you found it | 10:22 |
maxpaln | :-) | 10:23 |
maxpaln | aaargh, going a little crazy looking at assembler and logic analyser outputs!! | 13:08 |
maxpaln | I've made some progress - I am now capturing the GPRs as well as everything else. It's relatively straight foward to track the code being executed now. | 13:09 |
maxpaln | the assignment to r3 actually happens immediately before the jump to arch_local_irq_restore | 13:09 |
maxpaln | I don't use assembler often and I forgot the instruction after a jump gets executed before the jump | 13:09 |
maxpaln | so rs actually gets loaded with the contents of the memory location held in r1: | 13:10 |
maxpaln | c0042410:07 ff 09 bb l.jal c0004afc <arch_local_irq_restore> | 13:10 |
maxpaln | c0042414:84 61 00 00 l.lwz r3,0x0(r1) | 13:10 |
maxpaln | Incidentally it gets assigned: 0x8701FFEC | 13:10 |
maxpaln | which ultimately gets OR'ed with r4 to form our problematic SR value. | 13:11 |
maxpaln | r4 is generated from a manipulation of the original SR value | 13:12 |
-!- guilherme is now known as Guest94017 | 13:13 | |
maxpaln | but as you can gather - this is more 'understanding' progress than 'solving' progress %-) | 13:13 |
-!- Guest94017 is now known as guilhermeluz | 13:13 | |
maxpaln | [and I'm trying to keep myself motivated by telling myself this will all be useful in the end!] | 13:14 |
maxpaln | ok, this has made my head hurt enough for one day - back tomorrow! Adios all. | 13:27 |
-!- Netsplit *.net <-> *.split quits: blueCmd, arokux, jeremy_bennett, olofk | 13:48 | |
-!- Netsplit over, joins: arokux | 13:54 | |
olofk_ | stekern: Thanks for merging all those pull requests | 20:38 |
stekern | np, I saw that you acked one of them and mentioned vacation, so I just swept all that made sense to me when I was at that | 20:40 |
stekern | dalias: so, this is where I'm at debugging the pthread_cancel.c test: the child from this http://pastie.org/9372507#80 never returns from sys_clone | 20:48 |
stekern | which makes this loop forever: http://pastie.org/9372507#82 | 20:49 |
dalias | stekern, child is not supposed to return from clone | 20:59 |
stekern | no, not from clone, I mean from the clone syscall | 21:00 |
olofk_ | Could someone with a quartus installation convert this file to hdl for me? https://github.com/myriadrf/myriadrf-boards/blob/master/de0nano-interface/gateware/rx_acq.bdf | 21:18 |
dalias | stekern, how does the syscall fail to return?? | 21:59 |
--- Log closed Thu Jul 10 00:00:12 2014 |
Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!