Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!decwrl!labrea!rutgers!apple!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: i860 Dhrystones
Keywords: i860 N10 Floating Point Dhrystones
Message-ID: <15226@winchester.mips.COM>
Date: 14 Mar 89 12:39:12 GMT
References: <654@cimcor.mn.org> <93088@sun.uucp> <701@pcrat.UUCP> <93452@sun.uucp> <15074@winchester.mips.COM> <210@intelca.intel.com>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 60

In article <210@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes:
...
>The i860 CPU benchmark report had a TYPO the Dhrystone benchmark used
>the Greenhill C compiler not FORTRAN.
>Sorry to dissappoint everyone who thought that we were getting great
>Dhrystone numbers by rewritting the benchmark in FORTRAN.
>
>As for the simulated numbers versus actual numbers.  We have an excellent
>correlation (within 3%) between simulated numbers and actual numbers.
>
>My speculation (note the word speculation) as to why the the Dhrystone 
>numbers are so good is: 
>
>	Clock Frequency
>	128-bit loads for string instructions
>	The clocks/instruction is 1 (I imagine other RISC chips
>	approach 1 clock/instruction but don't actually obtain it)

Thanx for the correction; that certainly saves wasting some time.

1) Can you say any more words on simulations?  I.e., everybody
understands that the memory system is irrelevant for almost-100%-cache-hit
programs [Dhrystone, Stanford, Whetstone], but we'd be surprised
that a 5-wait-state machine (the measured one) and the zero-wait-state
machine (the simulated one) would be within 3% on DP LINPACK, given the
speed of the basic FP ops.  Could the zero-wait-state thing also be a typo?

2) OK, I give up.  There must be something unbelievably clever going on
to use 128-bit loads for C-language string operations. I've looked
at the i860 Programmer's Reference Manual a bunch, trying to figure
out how to use either the FP unit or the graphics unit to do this.
The string copy on page 9-5 of the manual is the "natural" strcpy
(which doesn't use anything but byte load/store, and takes about 5 cycles/byte).
I haven't been able to find anything like "branch on any byte zero", and the 860
doesn't have unaligned word operations.  For a fair test, you MUST
use str* that only assume byte alignment of operands, and
you can't inline the str*.  The only place I can think of using 128-bit
loads is in the structure-copy, and it shouldn't be used there,
unless structures whose largest entities are words are always aligned
to 4-word boundaries, which seems unlikely.

3) Anyway, various people at various companies still can't figure
out why the number can reasonably be this high, under the
normal rules, UNLESS there's some really slick trick for
getting strcpy and strcmp down around 2 cycles/byte.
There just aren't enough differences between an R3000 and an 860,
on this benchmark, to account for this otherwise. [Everything
fits in the caches; an 860 wins some places, an R3000 wins in some
places; the R3000 has essentially no write stalls on this benchmark,
so difference between write-thru and writeback is irrelevant; etc;
since something like 40% of the time is spent in str*, and the rest is
spread around; it's really the major place to look.]

Maybe somebody at Intel would care to post the str* routines
and educate us?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086