Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!cs.utexas.edu!sun-barr!sun!chiba!khb From: khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) Newsgroups: comp.arch Subject: Re: fast memories (war superscalar) Message-ID: <108340@sun.Eng.Sun.COM> Date: 6 Jun 89 23:08:32 GMT References: <5128@pt.cs.cmu.edu> <26450@lll-winken.LLNL.GOV> <40985@bbn.COM> Sender: news@sun.Eng.Sun.COM Reply-To: khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) Distribution: usa Organization: Sun Microsystems, Mountain View Lines: 98 In article <40985@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: > >I have seen many applications (the big problems for which people want >the heavy iron) which don't utilize a cache well, even with a >half-megabyte cache. For example, a matrix multiplication processes >one matrix down columns and the other across rows. Some cases like this >can actually get poorer performance with a large line size. Many of those big iron applications run just fine with a modest (say 64-128K cache) combined with sensible implementation. Consider languages which are array savvy (say APL and f88) or libraries that are ... then cache can be effectively utilized. Or consider CONMAN Code from CalTech, authors: Arthur Raefsky, Scott D. King, and Bradford Hager. This code is a "well vectorized" _F_inite _E_lement analysis code. The algorithm is numerically stable, robust, and computationally efficient. It has been hosted on numerous scalar machines, hypercubes and vector machines. ..... stuff deleted for space reasons Machines CrayXMP 4/8 ctss, cft77 65Mflops (measured for BM1,2) 45Mflops(BM3) YMP 8/16 UNICOS, one processor 95Mflops (measured for BM1,2) Cray-2 (one processor) UNICOS Convex C1XP Sun 4/260 FPU1, f77 v1.2 Sun 4/330 FPU2, f77 v1.2 dalign and not Campus f77 v1.2 dalign and not The Cray Mflops figures were measured (not computed) by the hardware speedometer. It should be noted that the earlier generation of code (which this code is meant to replace) ran slower in VECTOR mode than this code does in SCALAR mode .... it is the authors contention that this is often the case...that a good implementation often does well on many different machines. .... Timing table bm1 bm2 bm3 scalar vector scalar vector scalar vector Cray-2 - 180.2 - 178.0 - 573.4 XMP 2869.3 153.2 - 154.3 6094.7 398.6 YMP - 92.0 - 91.2 - 233.7 convex 9383.2 2021.2 9383.3 1979.8 14513.8 4383.2 4/330 6808 6871.6 7678.7 4/330dalign 5975.56 5993.0 6145.31 ss-1 8880.84 9044.51 10177.09 ss-1dalign 7804.50 7779.85 9946.17 4/260 10290.91 10263.47 12613.88 4/280fpu2 8445.43 8428.12 10228.52 ..... The point being that Arthur (et al.s) algorithm runs quite nicely are well designed vector machines (i.e. acheives good vectorization rates) _and_ on scalar machines. A later implementation employes a better vectorized matrix factorization step, which increases the overall vectorization considerably. The key is that this is a modified skyline direct solver ... so cache works quite nicely. Arthur can be reached at arthur@oasis.stanford.edu for more details about the science involved. > >>>Why bend over backwards (inter-company contracts, risk, design cost, etc) >>>for a 100% when you can have an easy 90% solution? The marginal gain isn't >The best case is where a company makes both the uP and the memory (88000 >for example). >The gain isn't marginal, either. You statements may be OK now, but next >generation will see the CPU in the 5 to 20ns range, with srams in the >20ns and drams in the 80ns range? Clearly needs work. >-Stan "Do I have an opinion yet?" Well, a large register file has been described as a compiler managed cache :> Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)