Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!sun-barr!sun!chiba!khb
From: khb%chiba@Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS)
Newsgroups: comp.arch
Subject: Re: Register Scoreboarding
Message-ID: <104999@sun.Eng.Sun.COM>
Date: 16 May 89 04:48:34 GMT
References: <24821@lll-winken.LLNL.GOV> <GRUNWALD.89May9113443@flute.cs.uiuc.edu> <3288@orca.WV.TEK.COM> <19463@winchester.mips.COM> <170@dg.dg.com> <19661@winchester.mips.COM>
Sender: news@sun.Eng.Sun.COM
Reply-To: khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS)
Organization: Sun Microsystems, Mountain View
Lines: 74

In article <19661@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>
>ASSERTION 3: well, if interlocking works after all, then scoreboarding
>is better for performance reasons.
>	3A: in supercomputers
>	3B: in current microprocessors
>------ 

.... 

>ASSERTION 3
>	3A: supercomputers
>		Probably true

???

I can't think of a supercomputer which _does_ use scoreboarding, at
least not as I understand the term. Supercompuers tend to not have
data caches... instead they have very fast high bandwidth memory
systems (using many banks of memory) and rely on the fact that real
scientific programs tend to have well behaved, in some sense, (i.e.
sort of vectorizable) key loops. The fact that these are typically
FORTRAN machines, alters the program statistics somewhat. As has been
demonstrated by many vendors, lots of registers, software and
sometimes hardware pipelines, loop transformations (percolation
scheduling, etc.) and other unnatural acts are key to getting good
performance. Note that these remarks apply to machines like the
CDC6600, the various Crays, misc. Japanese vector machines, the Cydra
5 and the Multiflow machines (which is why I said "sort of
vectorizable"). 

These machines typically have interlocks; but nothing like the
scoreboard scheme of the 88K (unless my memory is very leaky this
week). 

>	3B: current microprocessors
>		Seems unlikelym until they start having memory systems
>		like 3A.

Scoreboarding probably works; but there seems to be a certain lack of
evidence that it is necessary. Seems overly complex to me .... but
what do I know... I studied math and grew up working in Kalman
filtering applications ... :>

>
>
>Like both 88K and MIPS, SPARC is defined to allow different-latency
>FP implementations, and in fact, 3 different ones are already extant.
>Perhaps the SPARC guys would care to join the fun and talk about
>differences in latencies, overlap, etc.  [If you haven't noticed it,
>SUn-4s recent got the FPU2 that raised FP performance in the same
>systems.]

I don't know what to say... other than that it works just fine. All
the binaries I tested (and it was more than a few) worked up and down
the line. It is true that it is possible to tickle the compiler into
generating code which is better for one chip or another and this is
likely to continue. There will always be a compatible mode (which most
people will use exclusively) which will have good performance for the
most popular implementations and different implementations will have
some special code ordering/other implementor neat stuff ... 

As John pointed out in an ealier posting, SUN has chosen to give
implementors more leeway (the N-design teams notion), and folks at
Prisma probably will have lots of neat stories to tell around Jan 1990
(4nsec SPARC ... design goal of 100Mflops) about how well SPARC really
scales, and how the cleverly picked interlocks were just right (or
caused them endless grief :>).


Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)