Path: utzoo!attcan!uunet!wuarchive!zaphod.mps.ohio-state.edu!think!snorkelwacker!apple!amdahl!tetons!bdg
From: bdg@tetons.UUCP (Blaine Gaither)
Newsgroups: comp.arch
Subject: Re: Cache Size
Message-ID: <4568@tetons.UUCP>
Date: 1 Mar 90 17:24:31 GMT
References: <7393@pdn.paradyne.com> <76700146@p.cs.uiuc.edu> <1990Feb26.022057.28461@Neon.Stanford.EDU> <8189@pt.cs.cmu.edu> <8848@boring.cwi.nl> <100324@convex.convex.com>
Organization: Amdahl Corp., Rexburg, ID
Lines: 64
In-reply-to: patrick@convex.com's message of 28 Feb 90 17:37:05 GMT

>In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) 
>>writes:
>>1.  No vector registers, bypass cache (Cyber 995).
>>2.  Vector registers, bypass cache (i know none).
>>3.  No vector registers, through cache (again, i know none).
>>4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).
>
>Actually, in the Convex C2 series, vector loads and stores bypass the cache.
>Scalar operations use the cache.  One argument for this design is that
>the large amount of data used in vector operations would quickly invalidate
>most scalar data in the cache, such as local variables, loop limits, etc.

I have to agree with the arguments states by Pat.  The Gould NP1 was
(still is) a cache based vector mini-super.  All vector operation went
through a store-through cache.  The cache size was small (16KB data).
It turned out to perform rather poorly on some scientific codes.  The
basic machine was about 2x its competition at scalar, and about equal
on vector codes if their data fit in the data cache.  If an algorithm
didn't fit in the cache, the result was to run at .3 to .5 the
in-cache speed.

There were several other factors of the hw/sw design that were not
optimal for the scientific market, but the cache based design did not
help.  One problem is that the smaller the cache, the greater the
emphasis that must be placed upon algorithms to make better use of the
memory hierarchy.  If user and 3rd party software vendors use vendor
libraries, they can get good performance (if the vendor bothers to do
the requisite algorithm work).  If they don't then is an uphill battle
for compiler writers to recognize poor algorithms and substitute
better ones.  These tools must be ready before 3rd party ports start.

Cache based vector machines can in general be assumed to get the same
number of cache faults as a machine executing in scalar (with the same
algorithm) would get.  The big problem is vector machines get them
over a shorter period of time, thus it becomes a relatively more
serious problem than it would in a typical scalar machine.

Another usability problem with a cache based approach is the
discontinuity it causes in the performance when plotted vs problem
size.  This is very disconcerting.  If when you do a matrix multiply
of 2 nxn matrices and then multiply 2 (N+5)x(n+5) matrices you can
experience a startling slowdown (m sec vs 3m sec).  This causes some
very intersting measurement problems in scientific benchmarks which
attempted to find [N sub 1/2].  i.e. You have to find the peak vector
performance (and it ain't on 1000x1000 matricies).  So measurement
techniques that expect monotonically increasing (or worse continuous)
values of performance vs problem size get wrong answers.

Well 16K is a very small cache for scientific problems.  What about
larger caches.  Well larger is better.  It moves the discontinuity so
it affects fewer problems.  At some point you might not care. But if
the reason you are buying the architecture (as well as the
implementation) is because you see ever increasing problem size, then
even if it works fast enough on todays problems, will it be fast
enough for tomorrows problem sizes.

What can a cache based vector machine do?

	Use a dual ported cache with parallel prefetch.

	Use a very large cache.

	Use a very very large secondary cache to reduce miss penalty.