IRC logs for #openrisc Tuesday, 2014-07-15

--- Log opened Tue Jul 15 00:00:20 2014
-!- Netsplit *.net <-> *.split quits: mboehnert1, ssvb_, jungma02:42
-!- Netsplit over, joins: ssvb_02:42
-!- Netsplit *.net <-> *.split quits: clopez, _franck_, jonmasters, rah, jungma, javax, xlro, jeremy_bennett, dalias, mboehnert03:12
-!- Netsplit over, joins: jeremy_bennett, rah, _franck_, xlro, javax, jonmasters, dalias, clopez, mboehnert03:20
stekernblueCmd_: btw, since you've became such a git advocate, you should embrace it fully and start using present tense in your commit messages ;)03:25
stekernoh, and regarding the master->slave change, I think my notion was a mistake there actually03:32
stekernthat said, I've heard that the master and slave notion can be offensive to some people, so I suggest we change all occurrancies of 'master' to 'wife' and 'slave' to 'husband'03:33
stekerns/occurrancies/occurrences03:35
stekerndalias: when you have a minute, I'd like to pick your brain on this pthread_robust issue I've been investigating.05:27
stekernthe gist of it is that from what I understand, musl ends the robust_list with a null pointer. but the kernel expects the list to point back to itself in: http://lxr.free-electrons.com/source/kernel/futex.c#L291005:28
daliasstekern, sure05:30
stekernon x86_64 (for example), this is 'fine', since the get_user will silently exit exit_robust_list on a null pointer.05:30
stekernbut on or1k, the null pointer will be de-referenced05:30
dalias...05:30
daliaswhy is the null pointer dereferenced?05:31
stekernthis is at least what I have made out of it, I might be misunderstanding things...05:31
daliasi mean why does the kernel have a mapped zero page?05:31
daliasthat's a source of endless problems, many security-critical05:31
stekernwell, that's another issue... but our executables get mapped to page 005:31
dalias...05:31
stekernheh05:32
stekernyeah, it's a problem, but let's look past that for a second? ;)05:34
daliasDocumentation/robust-futex-ABI.txt says it stops scanning when a pointer in the linked list is not a valid userspace word address05:34
daliasit doesn't say anything about pointing back at itself05:35
stekernhmm, yeah... so the problem is really that we have a mapped zero page05:40
daliasmapped in userspace?05:40
daliasi could understand if there are arch-specific reasons you have a zero page mapped in kernelspace05:40
daliasbut zero page should absolutely never be mapped in userspace05:40
stekernright05:41
stekernso, let's look into fixing that ;)05:44
stekernanyway, I can consider that issue closed as far as the or1k musl port goes, that's what I wanted to conclude first.05:49
stekerndalias: once again, thanks for the help05:51
daliasok05:53
stekernnext - ipc failures06:18
daliasprobably wrong struct definitions in the bits headers06:19
daliasthat's a common issue in new ports06:19
daliasthese structs are the '64' versions06:19
stekernok, thanks for the hint, will look into that06:21
daliasgoing to sleep now, you can leave a message for me tho if you have any questions06:28
stekernnight06:42
stekernok, that was easy. we're not selecting ARCH_WANT_IPC_PARSE_VERSION06:51
stekernso, the IPC_64 wasn't masked out from the command06:52
stekernblueCmd_: I bet you don't have anything against this? http://pastie.org/939230408:15
jtdesousawhile running fusesoc with icarus and then using openocd to connect gdb we found that if the program is not loaded by fusecoc thne  gdb can load it but continuing never ends;09:21
jtdesousaif the program is loaded initially by fusesoc then it can be reloaded by gdb and continued.09:21
jtdesousaWith Verilator this issue is not present.09:22
blueCmd_stekern: nope, what does it do?09:33
blueCmd_stekern: do you have an example of present tense and where I use past tense?09:33
stekernblueCmd_: it moves our executables out of page zero09:33
blueCmd_stekern: +109:34
stekernblueCmd_: present tense: add <feature> past tense: added <feature>09:34
blueCmd_stekern: ack, input has been accepted. warning: changed configuration will only take effekt from now on and not affect past commits09:36
stekernheh09:37
blueCmd_on a related note: affect/effect is hard09:38
stekernyes, I can never tell which one should be used09:38
stekernmixing those up is probably minor to all my other mistakes though ;)09:39
stekernI like how you threw a swedificated 'effekt' in the mix there too! =P09:39
_franck__I'm not good at writing English but at least I know how to use affect/effect since it's mostly the same in French ;)09:40
blueCmd_stekern: aw crap09:41
blueCmd_something takes effect, but is being affected09:41
blueCmd_I think that's correct09:41
_franck__it is (at least AFAIK)09:42
blueCmd__franck__: I have a french 2m from he. yesterday he came dressed in striped cloths and had a baugget with him09:43
_franck__if one day you leasr French, you'll be able to use affect/effect :)09:43
blueCmd_(spelling is probably totally off 'baugget' - looks horrible)09:43
_franck__s/leasr/speak09:43
stekernbaguette09:44
stekernfinnish-swedish people call that a 'batong'09:44
_franck__yes baguette09:44
blueCmd_well I had the letters right anyway09:44
stekern...which means 'baton' in swedish09:45
stekernbut they just took the finnish word 'patonki' and turned it a bit swedish09:46
stekern'patonki' doesn't mean anything else than baguette (afaik)09:47
blueCmd_olofk: you spoke earier about the need to refactor wb_intercon_gen - I agree, but I think you should in that case focus on the tool being standalone just as it is now, and make it just as easy to use without another tool. then your GUI graphical "point-and-drag" interface would just write the config for wb_intercon_gen09:55
-!- Netsplit *.net <-> *.split quits: clopez, _franck_, jonmasters, rah, javax, xlro, mafm, dalias, mboehnert, jeremy_bennett, (+1 more, use /NETSPLIT to show all of them)09:56
stekern+109:56
blueCmd_what I love with wb_intercon_gen is that it's easy to interop it in whatever workflow I have, input is one file, output is two - you could probably prove it mathematically if you really put your mind to it10:00
-!- Netsplit over, joins: maxpaln10:00
-!- Netsplit over, joins: _franck_, mafm10:01
daliasstekern, if or1k does not need IPC_64 we should just def that macro to 010:14
daliasno need for kernel change10:14
stekerndalias: you mean if we don't need to support IPC_OLD? ok, makes sense10:20
daliasif there's only one version of the ipc structs for the arch and thus IPC_64 is not needed to get the correct version, IPC_64 can just be defined to 010:21
daliaslike it is on x86_6410:21
stekernyeah, I see10:22
daliasdoes that resolve the remaining test failures?10:45
stekernoh, there's more, but it does resolve the ipc failures10:46
daliasah :)10:46
stekernsem_open fails, but I don't even have /dev/shm so it's no wonder. then there's some socket failures, but I think that's due to my kernel config in the test setup10:51
stekernthe stat test fails too10:51
stekernand then there's still a bunch of tests I haven't ran yet10:52
blueCmd_olofk: I split the submodules pull request now10:58
blueCmd_that was much harder than I thought it would be. git does not like read HEAD~1 if it contains submodule changes10:58
daliasstat failing may be a wrong struct definition10:58
daliassem_open is probably just the missing /dev/shm mount10:59
maxpalnI've fixed the last bug (aren't they always so easy to fix once you've spent FOREVER finding them!)12:37
maxpalnI am now trying to track down why Linux isn't producing any output on the console12:37
maxpalnI can see that after a short period it is getting stuck in the last few instructions of the <die> function - obviously not good!12:38
maxpalnI've traced it back - it appears to be called from within <do_unaligned_access>12:38
blueCmd_yeah, that's bad :P12:38
blueCmd_don't misalign your data :)12:39
maxpaln:-) thanks12:39
maxpalnI'm tracing back through to find the original cause of this - but I am a little surprised these cause a terminal failure. I would have expected a straight forward exception for this...12:39
maxpalnmaybe there's a clue right there...12:39
blueCmd_well, linux has exception handlers for unaligned access12:40
maxpalnah, I see - so the kernel caught the exception instead of the processor12:42
maxpalnmakes sense I guess12:42
blueCmd_well no, the CPU jumps to 0xa00 (IIRC) on unaligned access, and linux has code there that handles that12:44
_franck__http://lxr.free-electrons.com/source/arch/openrisc/kernel/entry.S#L31312:44
maxpalnAh, I see - ok that makes even more sense12:44
maxpalnas an aside, I can see the printk code getting executed as part of the die process (during show_registers) but I am not getting any activity on the UART console12:45
maxpalnNot being a Linux expert I have no idea if I really should be able to see UART output (which would make debug a LOT easier) or whether the Linux kernel needs to get further into the boot process to be able to output over the console.12:46
maxpalnanyway, I'll keep searching.12:46
blueCmd_I feel I don't have enough state to guide you on this, it's weird that you get alignment errors in Linux unless you modified the kernel or put the UART on some odd address12:46
ysionneauwaaa there is a #if 0 in the upstreamed code?12:47
maxpalnwell, I guess it is possible the write to the UART is causing the alignment problem - but I can't be conclusive on this.12:48
maxpalnI am essentially debugging the OR1200 processor with a new DDR3 memory controller - written by me using our DDR3 IP.12:49
maxpalnso it is possible there is a bug in the memory controller - I have already found one of those.12:49
maxpalnbut I have also tried to strip back the kernel so it doesn't do too much in the way of peripheral loading -12:49
blueCmd_maxpaln: right, so in that case I would put my money on a stack restore getting the wrong register contents12:50
blueCmd_and then loading an address from there ends up using an address that's not aligned12:50
blueCmd_I've had that issue myself12:50
maxpalnok - that makese sense.12:50
maxpalnThe last bug was something similar actually12:50
blueCmd_ysionneau: yes, ugly - I agree12:51
maxpalnBut I don't think any of this gets me much closer to finding the root cause :-(12:51
blueCmd_maxpaln: for this I have written a newlib 'diagnostics' program that does a bunch of memory operations to test stuff like that12:51
maxpalnoooh,12:52
blueCmd_it's quite board specific, but it might be easier to drill down using something like that12:52
maxpalnalthough without a console I am not sure how I would get at the output!!!!12:52
blueCmd_maxpaln: right12:52
blueCmd_writing this in assembler wouldn't be hard though12:52
blueCmd_avoiding memory read/writes for UART12:52
maxpalnI'll continue on this route for now - I'm still making something that feels like progress12:53
blueCmd_maxpaln: let me know if you want to hack together a memory write / readback test for you12:53
blueCmd_that won't use memory for uart12:53
maxpalnok, thanks12:53
maxpalnIt is worth mentioning that the UART does work - I have a simple 'Hello World' program that prints to the UART - it works fine.12:54
blueCmd_yes, i would imagine so12:54
maxpaln:-)12:54
maxpalnbut it at least gives me confidence the HW works :-)12:54
maxpalnroughly12:54
blueCmd_what's your sys_clk speed and what baud are you using?12:55
maxpalnclock of 50 MHz, baud of 11520012:55
blueCmd_where is the uart? 0x90000000 ?12:55
maxpalnYep12:56
maxpalnI was just looking at the Linux assmebler - I can see the alignment exception handler at 0x60013:07
maxpalnbut it doesn't seem to call the code that is being exectued in my code13:07
maxpalnin fact the do_unaligned_access() function is only called from _alignment_handler()13:08
maxpalnbut nothing seems to call alignment handler - which puzzles me a lot13:08
maxpalnor rather, there are no jumps to _alignment_handler() or any code within it in the assembler!!13:09
_franck___alignment_handler is "called by the hardware". It's located at exception vector. When alignment error happend, the PC jump at 0x600 automatically13:13
_franck__you have to look at the pc before it jumps to 0x60013:13
daliasstekern, btw what is the reason the zero page is mapped, and is it supposed to be mapped/visible to userspace?13:14
daliasperhaps i would have some ideas for fixing it if i knew why..13:14
maxpaln_franck__: thanks - although I don't think I quite follow the sequence. the Linux kernel has a section of code at 0x600 for alignment exceptions13:15
maxpalnbut there is a separate section of code at 0xc00053fc called _alignment_handler13:15
maxpalnthis is where my code ends up before entering die() and getting stuck in a loop13:16
maxpalnI was hoping to track back through the assembler to see where what instruction(s) could jump to _alignment_handler() but there aren't any!!13:16
blueCmd_maxpaln: yes, the code at 0x600 is only a trampoline that jumps to the other function13:17
stekerndalias: probably for hysterical raisins13:17
blueCmd_maxpaln: or, maybe not - but that's how I tought it was13:18
maxpalnblueCmd_: I don't see anything in the code at 0x600 that could take it anywhere outside that block of code13:18
stekernbut there's no reason we should map our executables to page zero from now on13:18
maxpalnbut I don't doubt you are correct - it must be my understanding.13:18
daliasstekern, oh, it's not a kernel matter but just the ELF headers requesting an address of 0 ?13:19
stekerndalias: exactly13:19
daliasi'm surprised that would even be honored -- i thought 0 meant (like for mmap) assigned-by-kernel13:19
blueCmd_maxpaln: http://lxr.free-electrons.com/source/arch/openrisc/kernel/head.S#L32713:19
blueCmd_and that uses http://lxr.free-electrons.com/source/arch/openrisc/kernel/head.S#L124 to jump13:20
daliasalso, normally the kernel has an option to refuse mapping the zero page or any low pages...13:20
stekerndalias: maybe the actual issue is deeper, that the kernel allows that. but we kind of have to allow that for old binaries then.13:21
maxpalnhmmm, ok - I was approaching from a different angle. I disassembled the linux kernel and was reviewing the assembler. It isn't a straight forward comparison %-)13:22
daliaswell if the kernel doesn't have any explicit code to stop robust list processing at a null pointer, i think you either need to add this to the generic, arch-independent code as an explicit check...13:23
stekernanyway, I have a patch for the linker script in binutils, I'll push that upstream asap. that should at least catch null pointer derefencing...13:23
dalias...or make the or1k futex atomics the kernel uses explicitly check for null and emulate a fault13:24
stekernand with that, the pthread_robust test passes (as expected)13:24
daliasideally the text address should be >64k or so13:25
maxpalnblueCmd_: but like I say, I am sure you are right - this is a weak area for me!13:25
dalias(vm.mmap_min_addr is recommended to be 64k or higher)13:25
stekernsem_open fails with /dev/shm present as well, but the reason for that seems to be stat related too, so I think that failure is what I should look into now13:26
daliasyes13:26
blueCmd_maxpaln: the way it uses to jump to the symbol wouldn't show up in a disassembler13:26
blueCmd_maxpaln: since we split it by hi/lo the disassembler wouldn't show you a text reference13:26
stekerndalias: ah, ok. I have to read into that more, I adjuested the textaddress to start at 0x2000 (or1k pages are 8192), maybe I should increase that even more then.13:26
daliasbasically there are lots of potential kernel vulns if the kernel accidentally dereferences a null pointer and the deref doesn't fault13:27
blueCmd_maxpaln: if you look at your 0x600 in your elf you should see code where the last four instructions should be loading the address, one l.mtspr and an l.rfe13:27
daliasand you want mmap_min_addr to be larger than any offset the kernel might happen to access13:28
daliasthe kernel folks recommend 64k or so and they probably have their reasons, but they might just be building in extra padding for safety13:28
maxpalnblueCmd_: yep, I see that - ah, and now you've pointed it out I can see that's what the code is doing.13:29
daliasstekern, i think i found your stat problems13:29
daliassee include/uapi/asm-generic/stat.h13:30
daliasyou have the st_ino in the wrong place and various wrong padding13:31
stekernyup, I see13:33
dalias(stat64 is the relevant one)13:34
stekernlooks like I just have copied that from microblaze and then forgot to change it :/13:34
stekernI think I'll go through the rest of the bits/ files tonight, might be more of that kind...13:37
dalias:)13:37
stekerngot to run now, bbl13:37
daliask13:37
blueCmd_maxpaln: how big is your memory?13:41
blueCmd_(how is the map? 0x0 - 0x????)13:42
maxpalnblueCmd: the memory itself is very big - its a DDR module so at least 1GB, I would have to check13:50
maxpalnbut I guess you are asking how much memory is available to Linux...13:50
maxpalnit's in the dts file as:13:51
maxpalnmemory@0 {13:51
maxpalndevice_type = "memory";13:51
maxpalnreg = <0x00000000 0x02000000>;13:51
maxpaln};13:51
maxpalnI should probably expand this but this should be enough13:51
blueCmd_http://openrisc.debian.net/tmp/maxpaln/13:53
blueCmd_that should write all addresses between 0x00000000 0x02000000 and read it back13:53
maxpalnblueCmd_: Thanks - taking a look now14:05
blueCmd_the code is at https://github.com/bluecmd/mexiko/blob/master/src/bootrom/mem_stress.S if you feel like modifying it14:06
maxpalnso can I use the .v as the ROM code?14:06
blueCmd_maxpaln: I use it as rom code yes14:06
maxpalnSo presumably I should look out for EPCR changes against this code14:07
blueCmd_I place it in a rom at 0xf0000000 so that line 64 is at 0xf000010014:07
blueCmd_this code doesn't do any exception handling14:07
maxpalnthat works, my ROM is at 0xF0000000 too14:08
blueCmd_it doesn't do stacks either, it just reads/writes RAM and outputs progress on UART14:08
maxpalnah, ok - well that would be a good start14:08
maxpalnOut of interest, How does the .bin version work? Presumably you'd need to set the address range to be outside the program storage of the OR1200.14:09
maxpalndoes the code check if the read matches the write?14:09
maxpalnI will build this shortly and test it out :-)14:10
blueCmd_maxpaln: the bin and .v is linked to be loaded on 0xf000010014:10
maxpalnah, ok - I see - the .bin is the precursor to .v - makese sense14:11
blueCmd_as in https://github.com/bluecmd/mexiko/blob/master/src/bootrom/bootrom.ld14:11
blueCmd_maxpaln: yes, sort of. i use the ELF as source of truth, but I didn't know if you wanted it in verilog or binary blob14:11
maxpalnverilog will do fine :-)14:11
maxpalnok, I think I need a little help on this one - I have traced back my problem14:55
maxpalnit occurs because of a misaligned access to memory14:56
maxpalnit occurs when an instruction in the enable_mmu function tries to load a word from memory using the following instruction:14:58
maxpalnc01ba174:84 79 00 00 l.lwz r3,0x0(r25)14:58
maxpalnThe problem is that r25 at this stage contains 0xFF14:58
maxpalnI traced all the way back through the Linux boot and there is no previous assignment to r25 - the closest is in _start14:59
maxpalnwhen all the registers are initialised - except r2514:59
maxpalnc01ba064:e2 e0 00 04 l.or r23,r0,r014:59
maxpalnc01ba068:e3 00 00 04 l.or r24,r0,r014:59
maxpalnc01ba06c:e3 40 00 04 l.or r26,r0,r014:59
maxpalnc01ba070:e3 60 00 04 l.or r27,r0,r014:59
maxpalnIs this an error or deliberate - is r25 meant to contain something sensible at this stage?15:00
maxpalnoh, hang on15:00
maxpalnI was a little premature -15:00
maxpalnr25 is initialised at the top of _start()15:00
maxpalnsorry - my bad!15:01
maxpalngoing back in...15:01
maxpalnok, I can see the problem now15:09
maxpalnalthough I am really confused about the real root cause15:09
maxpalnthe Linux boots by running through the handful of instructions at 0x100:15:09
stekernmaxpaln: r25 isn't expected to hold a sensible value, but r315:09
maxpalnc0000100:19 e0 c0 1b l.movhi r15,0xc01b15:10
maxpalnc0000104:a9 ef a0 00 l.ori r15,r15,0xa00015:10
maxpalnc0000108:19 a0 40 00 l.movhi r13,0x400015:10
maxpalnc000010c:e1 ad 78 00 l.add r13,r13,r1515:10
maxpalnc0000110:44 00 68 00 l.jr r1315:10
maxpalnc0000114:15 00 00 00 l.nop 0x015:10
maxpalnThis sends it to _start() - which has this as its first instruction:15:10
maxpalnc01ba000 <_start>:15:10
maxpalnc01ba000:e3 20 18 04 l.or r25,r0,r315:10
stekernit should contain a pointer to fdt15:10
stekern(or 0)15:10
maxpalnyeah, well r3 is never initialised from 0x100 onwards15:11
stekernthe kernel entry is kernel(void *fdt)15:11
maxpalnso that line in _start just takes whatever happens to be in r3 and puts in r2515:11
stekernyes, but you "call" the kernel from somewhere and pass arguments to it using the regular function arg passing ABI15:12
maxpalnstekern: I don't think I am getting that far - I can see the processor booting in the ELA15:12
maxpalnit goes through the ROM code loading from SPI to RAM15:12
maxpalnjumps to 0x10015:12
maxpalnexecutes the above code15:12
maxpalnand jumps to _start()15:12
stekernyes, r3 is expected to hold a sensible value already at the first line of code in the kernel15:13
stekerni.e. the fdt pointer is passed from a bootloader15:13
maxpalnAh, so the problem is in the ROM code - which leaves r3 holding a random value - probably the last byte of code from the SPI Flash15:13
maxpalnso what value should really be in r3?15:13
stekernsince you're not passing a pointer to fdt, 0 should be fine15:14
maxpalnok - (although I am not sure what the fdt pointer would be)15:14
stekernactually, any word aligned address that is accesible will do15:14
maxpalnok - well I'll try 0x015:14
stekernfdt = a compiled device tree blob15:14
maxpaln:-)15:15
maxpalnFor reference, this is the standard bootrom.S code that I am using - I am kinda surprised I didn't run into this problem before15:15
stekerni.e. you can feed the kernel a dynamic device tree configuration from a bootloader, instead of building it into the kernel itself15:15
maxpalnI must have gotten lucky and had the final byte in the SPI Flash contain something like 0x0015:16
maxpalnIt would be a sensible addition to the ROM code to reset all the registers it uses I guess15:16
stekernyeah, as I said, any accessible word-aligned value will "work"15:16
maxpalnstakern: ah I see now - during enable_mmu the contents of r25 (formerly r3 at the time of jumping to 0x100) become the reference to the fdt15:20
maxpalnhmmm, the bootrom.S really should include a reset of r3 in that case!! I wonder if the latest sample bootrom.S does this...15:20
maxpaln(ooops sorry - typo on your name stekern!)15:21
maxpalnWOOOOOOOOHOOOOOOOOOOO! After several months of work I have Linux booting on our new silicon!!! :-)17:44
maxpalnish - I am getting boot dialog on the UART  - but it is pausing mid boot. Something to debug tomorrow - I am very happy though!17:45
stekernhah, the stat failure was just silly... I had the wrong date set..20:32
stekernthe sem_open failure was however due to the wrong stat struct20:33
stekerndalias: when I started to look in to that, I remembered why I hadn't changed anything in it. I copied the microblaze one, and checked that microblaze use asm-generic/stat.h. Then I obviously got lazy and didn't check that it actually matched asm-generic/stat.h.20:36
stekernso I would assume that microblaze's bits/stat.h isn't correct neither20:37
-!- Netsplit *.net <-> *.split quits: fotis2, rjo, trevorman21:07
-!- Netsplit over, joins: trevorman21:07
-!- Netsplit over, joins: fotis221:08
blueCmd_mboehnert: gratz! :)21:19
stekernall functional tests passes now, and all but malloc-brk-fail regression tests passes23:33
stekerna bunch of math tests fails though23:34
stekernthat's for tomorrow... zzz23:34
--- Log closed Wed Jul 16 00:00:21 2014

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!