IRC logs for #openrisc Sunday, 2016-08-28

--- Log opened Sun Aug 28 00:00:45 2016
SMDhome1ZipCPU great news regarding UART and not so great ones about openrisc performance!02:20
SMDhome1Also, what's happening now at google summer  of code?03:29
ZipCPUSMDhome1: I'm not so sure I would declare the performance we measured "not so great."08:36
ZipCPUA better characterization is that it "makes sense".08:36
ZipCPUOpenRISC has no memcpy() or strcmp() instructions like the VAX did, so ... you'd expect a bit of a difference.08:37
ZipCPUIt's actually reflective of the RISC vs CISC tradeoff, and much to be expected.08:40
kc5tjaZipCPU: How so?  I was under the impression that a good RISC microarchitecture can compete with a CISC in pretty much all benchmarks.13:14
kc5tjaI guess I don't know too much about Dhrystones.13:14
ZipCPUkc5tja: We were actually comparing Dhrystone MIPS / MHz.  It's a clock independent measure of CPU speed.  Multiply it by your clock speed, and you get a measure of Dhrystone MIPS.13:15
kc5tjaWhat numbers are you getting for OR1K?13:16
ZipCPUDhrystone MIPS is a measure of your CPU speed, when compared with a VAX at 1MHZ clock speed, which is deemed to be 1DMIPS.13:16
ZipCPUI'm convinced the OR1K and even the ZipCPU would beat the VAX at DMIPS alone, simply because the instruction sets are simpler and so the clock rates can be faster.13:17
kc5tjaThat much I know, but hard to keep it in perspective wrt to other architectures.13:17
ZipCPUWell, consider this, RISC machines tend to have higher clock speeds than CISC machines, right?13:18
ZipCPUAnd CISC machines can do more logic per clock, no?13:18
kc5tjaMore logic, sure, but more useful logic remains debatable.13:19
kc5tjaThe papers describing RISC-I used VAX as their benchmark, and showed how RISC-I basically had a constant factor performance gain over VAX.13:20
ZipCPUYes ... but ... for the same clock rate?13:21
ZipCPU(BTW ... my plan is not to release OR1K's score until ORCONF ... sorry, but we can discuss hard numbers then)13:22
kc5tjaI'd have to check again; if there was a difference, it wasn't much.  Original RISCs were clocked around 12MHz, IIRC.13:22
ZipCPUHmm ... looking for numbers today, I've got Brakefield's comparisons.  Is a PDP11 or PDP8 architecture at all related to the VAX?13:24
kc5tjaPDP8 not so much, but the PDP-11 is the VAX's spiritual predecessor.13:27
ZipCPUNow, looking at Brakefield's work, there's a Spartan-6 implementation of a PDP-11 that can run at 64MHz.13:28
kc5tjaVAX in fact stands for Virtual Address eXtensions, as the original VAXes had the ability to run PDP-11 code in hardware.  The 32-bit CPU was always there of course, but VAX was to be an upgrade path from the PDP-11.13:28
ZipCPUThat implementation comes from the pdp11-34verilog project found (at least at one time) on www.heeltoe.com13:30
ZipCPUI know that I can run the ZipCPU at 80MHz on a Spartan-6, so ... there's a normalizing difference there.13:30
kc5tjaThis page: http://heather.cs.ucdavis.edu/RISC.pdf suggests that RISC-I/II software was no more than 50% bigger than an equivalent VAX program.  Put another way, the CISCness of the VAX should not contribute much to the DMIPS benchmarks.13:32
ZipCPUI disagree completely.  I got one benchmark score with no movc5 instruction, and then taught my DMA to do a multi-word move and got quite the performance boost.13:32
* ZipCPU is reading Norman Matloff's paper now13:37
* ZipCPU was just called away to lunch ...13:38
kc5tjaSure, but a RISC with unrolled loop would compete with your DMA transfer quite well.13:39
kc5tjaMove 20 or 30 words per iteration, and you reduce looping overhead by 1/20 or 1/30 what it would be normally.  Hence, RISC code is bigger, but as-fast.13:40
kc5tjaDMIPS/MHz is looking like a bogus benchmark.  What you want is DMIPS/total cycles executed.13:40
kc5tjahttps://en.wikipedia.org/wiki/Instructions_per_second -- that moment when you realize the 6502 is 4x faster than an equivalently clocked 68000.13:41
kc5tjaThis I *know* to be bogus, and amply demonstrates why I don't believe DMIPS/MHz.13:41
kc5tjaThe 6502 is about 1/2 the speed of a 65816 in 16-bit mode, and the 65816 only gets about 80% the performance of a 68000.13:42
kc5tja(and, yes, the 65816 has a block move instruction that takes 7 cycles per byte transferred.)13:42
kc5tjaCycle per cycle, this makes OR1K and RISC-V vastly more efficient at block moves with only their basic instruction set.13:43
SMDhome1I think I've found what's wrong w/ openrisc dhrystone results13:59
SMDhome1ZipCPU uses cycle num printed after simulation is over, but that includes cycles to printing results and etc.14:01
SMDhome1In this case we have next options: either we delete printfs or we increase dhrystone loops to eliminate printfs influence14:01
SMDhome1I'm running 1M dhrystone loops now, but for 200k I got better results than ZipCPU14:02
kc5tjaAnother question is which version of Dhrystone is being used.  1.0, 1.1, and 2.1 will all report different values for the same architecture.14:03
SMDhome1I'm using 2.114:03
ZipCPUSMDhome1: Not so.  I counted cycles (including printfs) for 20k cycles, and cycles (including printfs) for 10k cycles.  I then took the difference as the time it took to do 10k cycles--the printf time and reset time should've been otherwise constant between the two runs.14:31
ZipCPUkc5tja: Let's discuss loop unrolling for a moment.  When measuring the ZipCPU's performance, I unrolled the loops of the strcmp, strcpy, and memcpy manually.14:38
ZipCPUWhile the Dhrystone benchmark states that the code must be compilable, must come from GCC, it doesn't necessarily state that the library routines can't be hand-optimized.14:39
SMDhome1ZipCPU you can unroll loops as you want, I guess14:39
ZipCPUWell ... not quite.  Dhrystone is not meant to be hand optimized.  I'm sure there are those that do it, but it's *supposed* to be a measure that includes compiler performance.14:40
SMDhome1seems like false alarm, I'm rechecking15:25
_franck_ZipCPU, kc5tja : there is dhrystone numbers here: http://www.juliusbaxter.net/openrisc-irc/search?q=Dhrystone15:38
_franck_coming from stekern_15:39
kc5tjagcc can be made to unroll loops, so don't feel too bad over it.  :)15:47
kc5tjaAlso, it's a common complaint against Dhrystone that you're really testing the compiler's standard library performance more than you are the CPU itself.15:47
kc5tjaGeez.  1.2 to 1.5 are not all that bad.  In fact, it's positively stellar compared to many other, more commercially successful CPUs.15:52
kc5tja*cough* Intel *cough*15:52
kc5tja(Although, to be fair, they did soundly destroy the 68060 when the Pentium came out.)15:53
kc5tjaNow, see, I want to find out what Dhrystone ranking I get with my own RISC-V core, as well as with the S64X7.  Should be enlightening.  :)15:55
ZipCPU_franck_: That's all well and good, but what I need is something that I can repeat and therefore observe.  I'd like to know how that was done.  So far, all I've heard about prior runs of the benchmark is that they are not to be trusted.16:15
ZipCPUstekern: You were the one who ran Dhrystone last: Do you have any of the system, software, and/or assembly left behind from when you did it?16:16
ZipCPUI was also told that the prior number I was using was on an unrealistic simulator ... not through actual logic.16:17
stekern_ZipCPU: http://oompa.chokladfabriken.org/tmp/dhry/16:21
olofkIf you want to get rid of the printf overhead when running in simulations, maybe you should use the l.nop method to print bytes16:27
stekern_surely the printf's should not be part of the measurements16:28
-!- Netsplit *.net <-> *.split quits: Amadiro16:29
olofkI think that sounds strange too16:30
olofkAnd have we copied the optimized memset routine I wrote for the Linux port to newlib?16:30
olofkOr are we still using the byte-by-byte copies?16:31
olofkSMDhome1: GSoC is just finished. I submitted my final evaluation earlier today16:33
stekern_looking through old irc logs, the last dhrystone result I've mentioned seems to be 1.4416:34
olofkSMDhome1: Also, your pull request looks a bit odd. The first patch adds the pcu support, which isn't in upstream mor1kx yet, and the other adds the branch predictor files, which are already present in the .core file16:35
olofkstekern_: While you're here, when can we expect another mor1kx release? :)16:35
stekern_any day ;)16:37
olofk:)16:37
stekern_olofk: what's happened to you? booking hotels more than a month in advance.16:46
olofkstekern_: Yeah. It's kind of cheating, I know16:49
stekern_to keep up the traditions, I've booked a room there as well now ;)17:10
ZipCPUstekern_: Thank you!  That's what I've been looking for.  What score did you say it achieved?  1.44 was it?19:55
ZipCPUstekern_: Looking through your code, two questions come to mind: 1) Is there a particular reason that you combined the two files?  and 2) Why did you skip the Proc_6 processing?22:34
kc5tjaThat moment when you're trying to add interrupt support to your RISC-V core, and realize it can trap in only two cycles.22:45
kc5tjaEat it, 6502!  ;)22:45
SMDhome1stekern_: yeah, I know, I've failed inteligence check and I have to admit, I don't know how to use git in proper way23:13
SMDhome1olofk: previous message should be reply to you23:15
stekern_ZipCPU: I just took that from somewhere else23:23
-!- stekern_ is now known as stekern23:24
stekernI never really looked inside it too carefully23:25
--- Log closed Mon Aug 29 00:00:47 2016

Generated by irclog2html.py 2.15.2 by Marius Gedminas - find it at mg.pov.lt!