Path: utzoo!attcan!uunet!aplcen!samsung!usc!henry.jpl.nasa.gov!elroy.jpl.nasa.gov!gryphon!scarter
From: scarter@gryphon.COM (Scott Carter)
Newsgroups: comp.arch
Subject: Re: 55 MIPS & 66 MIPS
Message-ID: <22514@gryphon.COM>
Date: 21 Nov 89 00:23:48 GMT
References: <1358@bnr-rsc.UUCP> <31329@winchester.mips.COM> <22303@gryphon.COM> <3024@brazos.Rice.edu>
Reply-To: scarter@gryphon.COM (Scott Carter)
Organization: Trailing Edge Technology, Redondo Beach, CA
Lines: 75

In article <3024@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <22303@gryphon.COM> scarter@gryphon.COM (Scott Carter) writes:
>
>>ISSUE one instruction per cycle.  The 960 CA can issue three instructions per
>>cycle to the chosen three of four execute units.  I believe Intel has figures
>>showing that on the average they could infact issue two instructions per clock
>>_average_ [over what program set?], hence the 960CA can legitimately be called
>>66 Native MIPS average with 99 Native MIPS peak.  
>
>I think that's too optimistic.
>We've played some with an i860 on an evaluation board.
>The supplied compilers didn't attempt to issue more than 
>1 instruction/cycle (out of a max of three).  
>
>On a simple matrix multiply (single precision fp), 
>
>  multiplying 2 100x100 matrices took .52 seconds (3.8 MFlops)
>  multiplying 2 400x400 matrices took 86 seconds  (1.5 MFlops)
>
>versus a peak of 66 MFlops.  The poor performance on the larger
>size shows the effect of the small on-chip data cache.
>
>Using the VAST front-end, with hand coded vector primitives
>gives about 8.5 MFlops.
>
>Reworking by hand, being especially careful of the cache,
>gives about 26.5 MFlops, for either size.
>(This can be improved, but I think only slightly).
>This is fairly hot, though still not 66 MFlops.
>
>The challenge is getting compilers to take advantage of
>tiny caches and long pipelines and multi-instruction issue,
>as discovered below
>
>>>discovered that the R3000 was usually more than twice as fast on hand-coded
>>>programs, and overall was more than five times faster on compiled programs.
>
>Sounds like the MIPS compilers are more mature.  Certainly it's an
>easier target.
>
>Preston Briggs

[How did I wind up doing writing something which could be interpreted as
defending the 960?]

1) Thanks for the _Data_ on the 860.  It's on the order of what I would have
guessed - nice to have it confirmed by someone with actual knowledge :).

2) I'm not sure that any meaningful extrapolation can be made from the 860 to
the 960CA, given that their instruction parallelism mechanisms are utterly
different.  Comparison to something like the Super Titan (on integer codes)
would be rather more appropriate.

3) Agreed that comparisons on Real Programs (tm) [or at least Real Becnhmarks
(tm?)] is the only thing to go from.  I merely pointed out that for Intel to
claim 66 Native Mips is not a priori any more illegitimate than most other
vendors native MIPS claims.  Kudos to Mips for trying to not mention anything
other than Real Program numbers.

4) I would disagree about the Mips _Ada_ compiler being better than the
Intel/Biin 960 Ada compiler (agree wholeheartedly on C/Pascal/FORTRAN).  We
found that the performance ratio between the R3000 and the 960XA was much
wider on [somewhat larger than JIAWG] our own benchmarks in C, Pascal, and
FORTRAN than in Ada, either JIAWG or some other internal benchmarks.

5) Based on the code generated for the 960XA for the JIAWG benchmarks, I have
to say I can't believe in two instructions per clock for the 960CA on this
set (this is a GUESS only - any data I might have cannot be posted), but I
do think the 960 CA might well do twice as many useful instructions per clock ON
THIS BENCHMARK SET as an R3000, given what their Ada compilers generated.
Your mileage will undoubtedly vary.

6) If we need to express our religious loyalty, mine is with the R3000.

Scott Carter