Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!cs.utexas.edu!sun-barr!sun!chiba!khb
From: khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS)
Newsgroups: comp.arch
Subject: Re: fast memories (war superscalar)
Message-ID: <108340@sun.Eng.Sun.COM>
Date: 6 Jun 89 23:08:32 GMT
References: <5128@pt.cs.cmu.edu> <26450@lll-winken.LLNL.GOV> <40985@bbn.COM>
Sender: news@sun.Eng.Sun.COM
Reply-To: khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS)
Distribution: usa
Organization: Sun Microsystems, Mountain View
Lines: 98

In article <40985@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>
>I have seen many applications (the big problems for which people want
>the heavy iron) which don't utilize a cache well, even with a
>half-megabyte cache.  For example, a matrix multiplication processes
>one matrix down columns and the other across rows.  Some cases like this
>can actually get poorer performance with a large line size.

Many of those big iron applications run just fine with a modest (say
64-128K cache) combined with sensible implementation. Consider
languages which are array savvy (say APL and f88) or libraries that
are ... then cache can be effectively utilized.

Or consider

				CONMAN

Code from CalTech, authors: Arthur Raefsky, Scott D. King, and
Bradford Hager.

This code is a "well vectorized" _F_inite _E_lement analysis code. The
algorithm is numerically stable, robust, and computationally
efficient. It has been hosted on numerous scalar machines, hypercubes and
vector machines.
..... stuff deleted for space reasons

				Machines

CrayXMP 4/8 ctss, cft77               65Mflops (measured for BM1,2) 45Mflops(BM3)
    YMP 8/16 UNICOS, one processor    95Mflops (measured for BM1,2)
Cray-2 (one processor) UNICOS     
Convex C1XP                     
Sun 4/260 FPU1, f77 v1.2 
Sun 4/330 FPU2, f77 v1.2 dalign and not
Campus          f77 v1.2 dalign and not
The Cray Mflops figures were measured (not computed) by the hardware
speedometer. 

It should be noted that the earlier generation of code (which this
code is meant to replace) ran slower in VECTOR mode than this code
does in SCALAR mode .... it is the authors contention that this is
often the case...that a good implementation often does well on many
different machines.

....
                               Timing table

                 bm1                bm2                   bm3
          scalar     vector    scalar   vector     scalar   vector

Cray-2      -        180.2      -        178.0       -       573.4
XMP       2869.3     153.2      -        154.3     6094.7    398.6
YMP         -         92.0      -         91.2       -       233.7

convex    9383.2    2021.2    9383.3    1979.8    14513.8   4383.2

4/330               6808                6871.6              7678.7
4/330dalign         5975.56             5993.0              6145.31

ss-1                8880.84             9044.51            10177.09
ss-1dalign          7804.50             7779.85             9946.17  

4/260              10290.91            10263.47            12613.88
4/280fpu2           8445.43             8428.12            10228.52

.....

The point being that Arthur (et al.s) algorithm runs quite nicely are
well designed vector machines (i.e. acheives good vectorization rates)
_and_ on scalar machines. A later implementation employes a better
vectorized matrix factorization step, which increases the overall
vectorization considerably.

The key is that this is a modified skyline direct solver ... so cache
works quite nicely.

Arthur can be reached at arthur@oasis.stanford.edu for more details
about the science involved.

>
>>>Why bend over backwards (inter-company contracts, risk, design cost, etc)
>>>for a 100% when you can have an easy 90% solution? The marginal gain isn't
>The best case is where a company makes both the uP and the memory (88000
>for example).
>The gain isn't marginal, either.  You statements may be OK now, but next
>generation will see the CPU in the 5 to 20ns range, with srams in the
>20ns and drams in the 80ns range?  Clearly needs work.
>-Stan  "Do I have an opinion yet?"

Well, a large register file has been described as a compiler managed
cache :>


Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)