Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!think.com!mintaka!bloom-beacon!eru!hagbard!sunic!mcsun!cernvax!chx400!chx400!bernina!neptune!inf.ethz.ch!brandis
From: brandis@inf.ethz.ch (Marc Brandis)
Newsgroups: comp.arch
Subject: Re: cache pre-load/no-load instructions
Message-ID: <27671@neptune.inf.ethz.ch>
Date: 22 Mar 91 08:07:03 GMT
References: <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> <765@ajpo.sei.cmu.edu> <1991Mar21.161044.2898@rice.edu>
Sender: news@neptune.inf.ethz.ch
Reply-To: brandis@inf.ethz.ch (Marc Brandis)
Organization: Departement Informatik, ETH, Zurich
Lines: 48

In article <1991Mar21.161044.2898@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>The RS/6000 includes 2 interesting possibilities.
>An instruction that zeroes a line in the data cache (without
>fetching it).  May be used like (2 above); additionally handy for zeroing
>big chunks of memory.  They also include an "invalidate line"
>instruction which says: "don't bother writing this one back to memory."
>

Unfortunately, IBM made these instructions privileged. They had some good
reasons to do it, as the instructions ignore lock and protection bits. I do
not know the reasons why they could not make them check the bits, however.

I am not sure whether having these instructions in user mode would be a great
advantage. DCLSZ (data cache line set zero) can be used to initialize large
chunks of memory, of course. The other obvious target for the DCLSZ and CLI
(cache line invalidate) instructions is to control the allocation and 
deallocation of procedure frames on the stack so that no memory references
are generated for newly allocated stack space and that no deallocated stack
space will be written back to memory. 

I do not think that this mechanism would really improve the performance of
current programs. Many programs consume only a few kilobytes of stack space
and exhibit a large amount of spatial locality on their references. The number
of frames on the stack is almost constant over large fractions of many programs
and so is the top of the stack. Under this standpoint of view, it is very
unlikely that stack references cause cache misses, so that this 'optimization'
would not reduce the number of cache misses at all.

Now consider the cost of it. Considering the static overhead of a procedure
frame on the RS/6000 (6 words header, at least 8 words for output parameters)
and the typical number of saved registers (I assume 16 words) as well as some
additional local stack space (I assume another 16 words), a frame is about
46 words or 184 bytes large. The cache line size on the RS/6000 is 128 bytes,
so you would need two additional instructions at each procedure entry and two
additional instructions at each procedure exit (or three+three for the cost
reduced CPU in the models 320 and 520 with a 64 byte line size), adding some
overhead to each procedure call. While the overhead is not large, it may well
eat up the benefits that we are getting from the scheme.

Note that in order to make the same program run on machines with different
cache line sizes, some additional overhead to parametrize the entry and exit
code would have to be paid.


Marc-Michael Brandis
Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology)
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch