Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!think.com!mintaka!bloom-beacon!eru!hagbard!sunic!mcsun!cernvax!chx400!chx400!bernina!neptune!inf.ethz.ch!brandis From: brandis@inf.ethz.ch (Marc Brandis) Newsgroups: comp.arch Subject: Re: cache pre-load/no-load instructions Message-ID: <27671@neptune.inf.ethz.ch> Date: 22 Mar 91 08:07:03 GMT References: <765@ajpo.sei.cmu.edu> <1991Mar21.161044.2898@rice.edu> Sender: news@neptune.inf.ethz.ch Reply-To: brandis@inf.ethz.ch (Marc Brandis) Organization: Departement Informatik, ETH, Zurich Lines: 48 In article <1991Mar21.161044.2898@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >The RS/6000 includes 2 interesting possibilities. >An instruction that zeroes a line in the data cache (without >fetching it). May be used like (2 above); additionally handy for zeroing >big chunks of memory. They also include an "invalidate line" >instruction which says: "don't bother writing this one back to memory." > Unfortunately, IBM made these instructions privileged. They had some good reasons to do it, as the instructions ignore lock and protection bits. I do not know the reasons why they could not make them check the bits, however. I am not sure whether having these instructions in user mode would be a great advantage. DCLSZ (data cache line set zero) can be used to initialize large chunks of memory, of course. The other obvious target for the DCLSZ and CLI (cache line invalidate) instructions is to control the allocation and deallocation of procedure frames on the stack so that no memory references are generated for newly allocated stack space and that no deallocated stack space will be written back to memory. I do not think that this mechanism would really improve the performance of current programs. Many programs consume only a few kilobytes of stack space and exhibit a large amount of spatial locality on their references. The number of frames on the stack is almost constant over large fractions of many programs and so is the top of the stack. Under this standpoint of view, it is very unlikely that stack references cause cache misses, so that this 'optimization' would not reduce the number of cache misses at all. Now consider the cost of it. Considering the static overhead of a procedure frame on the RS/6000 (6 words header, at least 8 words for output parameters) and the typical number of saved registers (I assume 16 words) as well as some additional local stack space (I assume another 16 words), a frame is about 46 words or 184 bytes large. The cache line size on the RS/6000 is 128 bytes, so you would need two additional instructions at each procedure entry and two additional instructions at each procedure exit (or three+three for the cost reduced CPU in the models 320 and 520 with a 64 byte line size), adding some overhead to each procedure call. While the overhead is not large, it may well eat up the benefits that we are getting from the scheme. Note that in order to make the same program run on machines with different cache line sizes, some additional overhead to parametrize the entry and exit code would have to be paid. Marc-Michael Brandis Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology) CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch