Path: utzoo!attcan!uunet!wuarchive!zaphod.mps.ohio-state.edu!think!snorkelwacker!apple!amdahl!tetons!bdg From: bdg@tetons.UUCP (Blaine Gaither) Newsgroups: comp.arch Subject: Re: Cache Size Message-ID: <4568@tetons.UUCP> Date: 1 Mar 90 17:24:31 GMT References: <7393@pdn.paradyne.com> <76700146@p.cs.uiuc.edu> <1990Feb26.022057.28461@Neon.Stanford.EDU> <8189@pt.cs.cmu.edu> <8848@boring.cwi.nl> <100324@convex.convex.com> Organization: Amdahl Corp., Rexburg, ID Lines: 64 In-reply-to: patrick@convex.com's message of 28 Feb 90 17:37:05 GMT >In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) >>writes: >>1. No vector registers, bypass cache (Cyber 995). >>2. Vector registers, bypass cache (i know none). >>3. No vector registers, through cache (again, i know none). >>4. Vector registers, through cache (IBM 3090, Convex, Alliant, Gould). > >Actually, in the Convex C2 series, vector loads and stores bypass the cache. >Scalar operations use the cache. One argument for this design is that >the large amount of data used in vector operations would quickly invalidate >most scalar data in the cache, such as local variables, loop limits, etc. I have to agree with the arguments states by Pat. The Gould NP1 was (still is) a cache based vector mini-super. All vector operation went through a store-through cache. The cache size was small (16KB data). It turned out to perform rather poorly on some scientific codes. The basic machine was about 2x its competition at scalar, and about equal on vector codes if their data fit in the data cache. If an algorithm didn't fit in the cache, the result was to run at .3 to .5 the in-cache speed. There were several other factors of the hw/sw design that were not optimal for the scientific market, but the cache based design did not help. One problem is that the smaller the cache, the greater the emphasis that must be placed upon algorithms to make better use of the memory hierarchy. If user and 3rd party software vendors use vendor libraries, they can get good performance (if the vendor bothers to do the requisite algorithm work). If they don't then is an uphill battle for compiler writers to recognize poor algorithms and substitute better ones. These tools must be ready before 3rd party ports start. Cache based vector machines can in general be assumed to get the same number of cache faults as a machine executing in scalar (with the same algorithm) would get. The big problem is vector machines get them over a shorter period of time, thus it becomes a relatively more serious problem than it would in a typical scalar machine. Another usability problem with a cache based approach is the discontinuity it causes in the performance when plotted vs problem size. This is very disconcerting. If when you do a matrix multiply of 2 nxn matrices and then multiply 2 (N+5)x(n+5) matrices you can experience a startling slowdown (m sec vs 3m sec). This causes some very intersting measurement problems in scientific benchmarks which attempted to find [N sub 1/2]. i.e. You have to find the peak vector performance (and it ain't on 1000x1000 matricies). So measurement techniques that expect monotonically increasing (or worse continuous) values of performance vs problem size get wrong answers. Well 16K is a very small cache for scientific problems. What about larger caches. Well larger is better. It moves the discontinuity so it affects fewer problems. At some point you might not care. But if the reason you are buying the architecture (as well as the implementation) is because you see ever increasing problem size, then even if it works fast enough on todays problems, will it be fast enough for tomorrows problem sizes. What can a cache based vector machine do? Use a dual ported cache with parallel prefetch. Use a very large cache. Use a very very large secondary cache to reduce miss penalty.