Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!mcsun!mcvax!dik
From: dik@cwi.nl (Dik T. Winter)
Newsgroups: comp.arch
Subject: Re: How Caches Work
Message-ID: <8399@boring.cwi.nl>
Date: 11 Sep 89 20:12:06 GMT
References: <21936@cup.portal.com> <1082@cernvax.UUCP> <16306@watdragon.waterloo.edu> <RANG.89Sep10184900@derby.cs.wisc.edu>
Organization: CWI, Amsterdam
Lines: 42

In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes:
 > In article <16306@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
 > >+      SUM = 0.0
 > >+      DO 10 I = 1, 1000000
 > >+	SUM = SUM + VEC(I)
 > >+   10 CONTINUE
 > >+
 > >+A data cache is *no use at all* for this problem.  You will get a
 > >+cache miss on every data access.  
 > >
 > >I thought caches respected the principle of locality.  And this code has
 > >really good locality.
 > 
 >   The instruction fetches here have good locality; the data fetches
 > don't (1 MW of data is being accessed).  The problem here is that 1
 > million words (assuming 1 word per vector element) of memory is being
 > accessed.  No matter HOW your cache is arranged, you are going to have
 > to go to main memory for every one of those words (conceivably barring
 > a few in the cache when you start).
 >   In fact, a (data) cache can hurt in this kind of code, because you
 > have 1 million data fetch times, plus the overhead of the cache
 > (small, but significant for very large problems).
This is not entirely true.  There are systems where a cache acts more or
less similar to the paging of VM.  Take for instance the system where
a cache line consists of 32 bytes and the width of the bus between
memory and cache is 32 bytes (i.e. in memory a word is 265 (8*32) bits,
but for the processor it is still  16 or 32 bits).  In that case fetching
a single operand in the processor will cache a memory word (or a number
of processor words).  So assuming the elements of array VEC are 4 bytes
there will be a cache miss every eighth element or a cache hit rate of
87.5%.  Changing the index in the loop to (33*I)mod 1000000 will reduce
the hit rate to 0%.  Because the bus between the cache and memory is wide
enough, no speed penalty is involved (except for cache overhead.
This is also true (although a bit less) if the bus between memory and
cache is smaller, but the cache is refreshed in lines of multiple words,
because the refreshment of the cache goes in general much faster than you
would see with individual memory access for each word (typically something
like one word per cycle, or faster).
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax