Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!mcsun!mcvax!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.arch Subject: Re: How Caches Work Message-ID: <8399@boring.cwi.nl> Date: 11 Sep 89 20:12:06 GMT References: <21936@cup.portal.com> <1082@cernvax.UUCP> <16306@watdragon.waterloo.edu> Organization: CWI, Amsterdam Lines: 42 In article rang@cs.wisc.edu (Anton Rang) writes: > In article <16306@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes: > >+ SUM = 0.0 > >+ DO 10 I = 1, 1000000 > >+ SUM = SUM + VEC(I) > >+ 10 CONTINUE > >+ > >+A data cache is *no use at all* for this problem. You will get a > >+cache miss on every data access. > > > >I thought caches respected the principle of locality. And this code has > >really good locality. > > The instruction fetches here have good locality; the data fetches > don't (1 MW of data is being accessed). The problem here is that 1 > million words (assuming 1 word per vector element) of memory is being > accessed. No matter HOW your cache is arranged, you are going to have > to go to main memory for every one of those words (conceivably barring > a few in the cache when you start). > In fact, a (data) cache can hurt in this kind of code, because you > have 1 million data fetch times, plus the overhead of the cache > (small, but significant for very large problems). This is not entirely true. There are systems where a cache acts more or less similar to the paging of VM. Take for instance the system where a cache line consists of 32 bytes and the width of the bus between memory and cache is 32 bytes (i.e. in memory a word is 265 (8*32) bits, but for the processor it is still 16 or 32 bits). In that case fetching a single operand in the processor will cache a memory word (or a number of processor words). So assuming the elements of array VEC are 4 bytes there will be a cache miss every eighth element or a cache hit rate of 87.5%. Changing the index in the loop to (33*I)mod 1000000 will reduce the hit rate to 0%. Because the bus between the cache and memory is wide enough, no speed penalty is involved (except for cache overhead. This is also true (although a bit less) if the bus between memory and cache is smaller, but the cache is refreshed in lines of multiple words, because the refreshment of the cache goes in general much faster than you would see with individual memory access for each word (typically something like one word per cycle, or faster). -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax