IRC logs for #openrisc Wednesday, 2014-07-09

--- Log opened Wed Jul 09 00:00:11 2014
daliasstekern, re: your setjmp/longjmp, the offset in sigsetjmp.s has to match the size of __jmp_buf, not the portion that happens to be used00:22
daliasso you either need to change the size of __jmp_buf or revert that change, i think00:22
daliasgenerally we try to match the ABI glibc uses (so, use whatever __jmp_buf size it has). that's less important for archs that aren't already in widespread use, but still probably the best approach.00:24
stekerndalias: ah! I had a creeping feeling that there was something I had forgot to change!02:15
stekernI think in this case, it would almost make most sense to do the change and align glibc accordingly02:16
stekernblueCmd: do you have any opinion on that02:16
stekernbtw, I'm running libc-tests in or1ksim now, and I get failures in pthread tests there too. I'm on to finding out why.02:26
daliaswhich ones?03:36
daliasi could probably offer you some ideas03:37
stekernpretty much all of them03:43
stekerninitial observation is that when a new thread is created, the data pointed at by the tp (r10) isn't correct03:44
stekernthe particular test I'm looking at now is pthread_tsd.c03:44
stekernso, it fails in pthread_setspecific() when it's called from start()03:45
daliascould the order of the clone args be wrong?03:46
daliasstrace could show you this03:46
stekernthat would be a sensible to look into if nothing else03:47
stekernto be even more specific, this is the line that fails:
daliasit crashes there?03:49
stekernyes, tsd == 0 there03:52
stekernsorry, it's on the line above (L7) it crashes03:52
stekernno, L8. it's the write that causes a segfault03:54
stekernnot so important though, more important is to find out why tsd == 003:54
daliasprobably __pthread_self() is returning the wrong thing03:58
daliaspossibly because clone set the wrong value for the thread pointer register03:58
daliascan you tell if clone messed it up?04:09
stekernI think clone is fine, I'm suspecting that my TP_ADJ in pthread_arch is perhaps not correct04:16
stekernno, the clone args *are* screwed up04:25
stekernI'm passing tls in r6, it should be in r704:25
daliasthat's one of the classic mistakes04:26
daliasbecause the clone arg order is randomly permuted, for NO REASON, on every arch04:26
stekernyeah... it's messy04:29
daliasbtw sh is even more fun....04:32
daliaseven with the args in the right order it didn't work04:32
daliasbecause the kernel has a bug, which has been present ever since the linux sh port was first added, where it clobbers the place where the syscall arguments are stored on the stack04:33
stekernheh, why not?04:33
stekernmove that question above your answer ;)04:34
daliassomeone trying to be clever actually overlapped the pt_regs structure in the caller (the syscall entry point) with the argument slots on the stack in the callees (the functions that implement the syscalls)04:34
daliasand didn't understand that the compiler is free to clobber its incoming argument space04:35
daliasso the pt_regs structure gets clobbered04:35
stekernfun times...04:35
daliassorry i didn't state that clearly -- what i mean is that the incoming argument slots on the stack are local objects of the callee and the callee is free to clobber them at any point where it no longer needs their values04:36
daliasanyway i think the fix is upstream now... :)04:37
daliasbut apparently nobody ever even tried using clone on sh...04:37
stekernthat sounds strange04:38
daliasyeah i had a hard time believing it too, but it was all we could figure04:39
daliasand making the syscall entry point adjust the stack properly so pt_regs doesn't overlap, and add local copies of the args, suddenly made clone work !04:40
stekernok, with the args in right order a lot of the pthread tests now passes04:48
stekernstill some problem with pthread_cancel04:50
daliaswhat's failing?05:03
stekernnot sure yet, got some $dayjob work to do atm, I'll take a look at it in a while05:05
maxpalnHi, back debugging my Linux issue -07:55
maxpalnI am starting to understand the mechanism behind this a bit more -07:55
maxpalnI can see the EPCR switch to 0xC0004B14 - in the disassembled Linux binary this translates to a section in the <arch_local_irq_restore> section07:56
maxpalnWhen this instruction is executed the OR1200 Instruction Bus address reads 0x00004B14 - which makes sense, the virtual Linux address of 0xC.. becomes a normal memory address of 0x0...07:58
maxpalnit doesn't read from memory so I am guessing this is pulled from Cache07:58
maxpalnI wanted to check that the correct data is retreived - make sure it matches with the expected instruction from the Linux binary07:59
maxpalnis it a fair assumption that I can check the Instruction cache data output to see this instruction?07:59
maxpalnIt takes a little while to modify the logic Analyser code and if it is a pointless exercise I'd rather not bother...07:59
stekernyou can, but I wouldn't bother08:00
stekernthe instruction is supposed to write to SR, and it does that, so I expect it is correct08:01
stekernwhat you want to do is backtrack why the value that is getting written to SR is wrong08:01
maxpalnah, ok -08:01
maxpalnso, if I understand the assembler correctm the value written to the SR is what eventually ends up in r3.08:14
maxpalnwhich looking through the rest of the assembler is always written prior to jumping to this section of code.08:15
maxpalnso I am guessing whatever is calling this section of code has somehow set r3 incorrectly.08:16
maxpalnmore digging required08:16
maxpalnhmmm, well we jumped here in this case from a section of code called <console_unlock>08:21
maxpalnok, so we are getting to this point via a vprintk_emit call - but this isn't the first call to that function. I can see plenty of previous calls that don't trigger the same behaviour. Interestingly, there is no output on the UART throughout any of this.09:29
stekernyou can probably assume that this is an hw error, not an sw error09:32
maxpalnyes, I had come to that conclusion09:32
maxpalnnot least because I did a compare between my known working linux kernel and this one - these sections are identical09:32
stekernI would guess that a memory access goes wrong somewhere (or a bug in or1200), that makes context store/restore go wrong09:32
maxpalnThe memory controller is the biggest change since I last had this working09:34
maxpalnalthough to be fair there have been a LOT of changes. not least the switch to a whole new family of silicon09:34
maxpalnbut the device has been tested by our factory and the majority of the code remains identical09:35
maxpalnthe memory controller is entirely new - as is the DDR3 IP controller - so I guess these are the prime culprits09:35
stekernideally you want to find the spot where in the user space code r3 still holds correct value and the spot where it's wrong and then try to capture what's in between09:35
maxpalnYep, although knowing what is the 'correct' value isn't that easy :-)09:36
stekernwell, if you can find the corresponding arch_local_save_flags to that arch_local_irq_restore, you will now09:37
maxpalnaha - good tip, going back in...09:38
stekernnot userspace code, but the non-preemptied kernel code I meant up there09:40
maxpalnok, I need a little direction here -10:00
maxpalninside console_unlock there are two calls to arch_local_save_flags10:01
maxpalnso the problem could be anywhere after the first (assuming the problem doesn't happen earlier in the code).10:01
maxpalnWhat I'd like to do is trace the value of r3 and maybe a few others in HW - but I don't know where to find them or what code is involved in the context store/restore process.10:02
stekernbut aren't there also a second acompanying arch_local_irq_restore to that second arch_local_save_flags?10:05
maxpalnyes, but lets approach from the other way around - the problem occurs during execution of arch_local_irq_restore, the call to this instance of that function occurred from address 0xC004241010:12
maxpalnimmediately prior to that call to arch_local_iq_restore there isn't an arch_local_save_flags10:13
maxpalnthe previous one is still within the console_unlock function but it is way back on address 0xc004200410:14
maxpalnthere doesn't appear to be a branch to this point in the code that has an associated arch_local_store_flags10:15
maxpalnso I am concluding the call to arch_local_store_flags happens earlier in the console_unlock function.10:16
maxpalnso far so good, except there are two calls to arch_ocal_save_flags10:17
stekernah, yeah I see what you mean, console_unlock isn't trivial to read in asm10:18
maxpaln[glad it isn't just me :-) ]10:18
stekernso, a better/easier approach to debug this is probably to keep the trigger you have, and try to capture r3 as far back as you can and see if you can get a lock on the spot where it's written wrongly10:20
maxpalnyep, that was where I started to go - and then struggled to find r3 in HW10:20
maxpalnI can find the SPRS10:21
maxpalnbut I am struggling to find the GPRs10:21
stekerncapturing register values is not entirely trivial though, the registers are contained in a RAM, so you have to either implement the registers as flops, or add a tap to the register file write to r310:21
maxpalnaha - its or1200_rf.v10:22
stekernwhere they are in the source in or1200, bets me, I haven't looked at that source for at least a couple of years, and only briefly then10:22
stekernah, good you found it10:22
maxpalnaaargh, going a little crazy looking at assembler and logic analyser outputs!!13:08
maxpalnI've made some progress - I am now capturing the GPRs as well as everything else. It's relatively straight foward to track the code being executed now.13:09
maxpalnthe assignment to r3 actually happens immediately before the jump to arch_local_irq_restore13:09
maxpalnI don't use assembler often and I forgot the instruction after a jump gets executed before the jump13:09
maxpalnso rs actually gets loaded with the contents of the memory location held in r1:13:10
maxpalnc0042410:07 ff 09 bb l.jal c0004afc <arch_local_irq_restore>13:10
maxpalnc0042414:84 61 00 00 l.lwz r3,0x0(r1)13:10
maxpalnIncidentally it gets assigned: 0x8701FFEC13:10
maxpalnwhich ultimately gets OR'ed with r4 to form our problematic SR value.13:11
maxpalnr4 is generated from a manipulation of the original SR value13:12
-!- guilherme is now known as Guest9401713:13
maxpalnbut as you can gather - this is more 'understanding' progress than 'solving' progress %-)13:13
-!- Guest94017 is now known as guilhermeluz13:13
maxpaln[and I'm trying to keep myself motivated by telling myself this will all be useful in the end!]13:14
maxpalnok, this has made my head hurt enough for one day - back tomorrow! Adios all.13:27
-!- Netsplit *.net <-> *.split quits: blueCmd, arokux, jeremy_bennett, olofk13:48
-!- Netsplit over, joins: arokux13:54
olofk_stekern: Thanks for merging all those pull requests20:38
stekernnp, I saw that you acked one of them and mentioned vacation, so I just swept all that made sense to me when I was at that20:40
stekerndalias: so, this is where I'm at debugging the pthread_cancel.c test: the child from this never returns from sys_clone20:48
stekernwhich makes this loop forever:
daliasstekern, child is not supposed to return from clone20:59
stekernno, not from clone, I mean from the clone syscall21:00
olofk_Could someone with a quartus installation convert this file to hdl for me?
daliasstekern, how does the syscall fail to return??21:59
--- Log closed Thu Jul 10 00:00:12 2014

Generated by 2.15.2 by Marius Gedminas - find it at!