Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!asuvax!ncar!mephisto!udel!udccvax1!mccalpin
From: mccalpin@vax1.acs.udel.EDU (John D Mccalpin)
Newsgroups: comp.arch
Subject: Re: Cache Size
Keywords: garbage collection, locality of reference, cache size
Message-ID: <5791@udccvax1.acs.udel.EDU>
Date: 28 Feb 90 18:23:37 GMT
References: <7393@pdn.paradyne.com> <76700146@p.cs.uiuc.edu> <1990Feb26.022057.28461@Neon.Stanford.EDU> <8189@pt.cs.cmu.edu> <8848@boring.cwi.nl>
Reply-To: mccalpin@vax1.acs.udel.EDU (John D Mccalpin)
Organization: College of Marine Studies, Univ. of Delaware
Lines: 41

In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
> > So, that's why most supercomputers seem to use vector register
> > files instead of caches for their vector units.
> > 
>Well, no (depends on your definition of supercomputer of course; let us
>assume vector processor).  There are systems without cache that use vector
>registers (Cray, NEC), or have memory to memory operations (Cyber 205).
>And there are processors with cache.  Possibilities are:
>1.  No vector registers, bypass cache (Cyber 995).
>2.  Vector registers, bypass cache (i know none).
>3.  No vector registers, through cache (again, i know none).
>4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).
>So, no, vector registers are not a replacement for cache.
>-- 
>dik t. winter, cwi, amsterdam, nederland
>dik@cwi.nl

I believe that the Convex C-2 series bypass the cache to load/store 
the vector registers, so it should be in category 2, not category 4.

The ETA-10 is a special case different than any of the above
categories.  It had no vector registers (being a memory-to-memory
machine), but it had a set of cache-like registers (the "short-stop"
registers) that were used to cache only short vectors.  I never got a
definitive answer from any of my ETA friends about exactly how big this
register set was used or exactly how the hardware decided to use it.  I
do know from direct observation that repeated access to short vectors
showed a lower start-up cost than the equivalent long-vector
operations.

The impression that I got was that this use is different than category
3, since the short-stop registers were only used for vector operands.
I think that short vectors were stuffed into these registers in
parallel with their first use, and then they could be re-loaded
directly from the short- stop registers if needed again soon.  The
output of the vector unit seems also to be loaded into the short-stop
registers (if short enough) to make back-to-back vector operations run
with slightly less overhead --- this is in the spirit of chaining.

Corrections from informed sources welcome....