Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!decwrl!labrea!rutgers!apple!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: i860 Dhrystones Keywords: i860 N10 Floating Point Dhrystones Message-ID: <15226@winchester.mips.COM> Date: 14 Mar 89 12:39:12 GMT References: <654@cimcor.mn.org> <93088@sun.uucp> <701@pcrat.UUCP> <93452@sun.uucp> <15074@winchester.mips.COM> <210@intelca.intel.com> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 60 In article <210@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes: ... >The i860 CPU benchmark report had a TYPO the Dhrystone benchmark used >the Greenhill C compiler not FORTRAN. >Sorry to dissappoint everyone who thought that we were getting great >Dhrystone numbers by rewritting the benchmark in FORTRAN. > >As for the simulated numbers versus actual numbers. We have an excellent >correlation (within 3%) between simulated numbers and actual numbers. > >My speculation (note the word speculation) as to why the the Dhrystone >numbers are so good is: > > Clock Frequency > 128-bit loads for string instructions > The clocks/instruction is 1 (I imagine other RISC chips > approach 1 clock/instruction but don't actually obtain it) Thanx for the correction; that certainly saves wasting some time. 1) Can you say any more words on simulations? I.e., everybody understands that the memory system is irrelevant for almost-100%-cache-hit programs [Dhrystone, Stanford, Whetstone], but we'd be surprised that a 5-wait-state machine (the measured one) and the zero-wait-state machine (the simulated one) would be within 3% on DP LINPACK, given the speed of the basic FP ops. Could the zero-wait-state thing also be a typo? 2) OK, I give up. There must be something unbelievably clever going on to use 128-bit loads for C-language string operations. I've looked at the i860 Programmer's Reference Manual a bunch, trying to figure out how to use either the FP unit or the graphics unit to do this. The string copy on page 9-5 of the manual is the "natural" strcpy (which doesn't use anything but byte load/store, and takes about 5 cycles/byte). I haven't been able to find anything like "branch on any byte zero", and the 860 doesn't have unaligned word operations. For a fair test, you MUST use str* that only assume byte alignment of operands, and you can't inline the str*. The only place I can think of using 128-bit loads is in the structure-copy, and it shouldn't be used there, unless structures whose largest entities are words are always aligned to 4-word boundaries, which seems unlikely. 3) Anyway, various people at various companies still can't figure out why the number can reasonably be this high, under the normal rules, UNLESS there's some really slick trick for getting strcpy and strcmp down around 2 cycles/byte. There just aren't enough differences between an R3000 and an 860, on this benchmark, to account for this otherwise. [Everything fits in the caches; an 860 wins some places, an R3000 wins in some places; the R3000 has essentially no write stalls on this benchmark, so difference between write-thru and writeback is irrelevant; etc; since something like 40% of the time is spent in str*, and the rest is spread around; it's really the major place to look.] Maybe somebody at Intel would care to post the str* routines and educate us? -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086