Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!rice!ariel.rice.edu!preston
From: preston@ariel.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: cache pre-load/no-load instructions
Message-ID: <1991Mar22.160421.27462@rice.edu>
Date: 22 Mar 91 16:04:21 GMT
References: <765@ajpo.sei.cmu.edu> <1991Mar21.161044.2898@rice.edu> <27671@neptune.inf.ethz.ch>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
Lines: 43

I wrote:
>>The RS/6000 includes 2 interesting possibilities.
>>An instruction that zeroes a line in the data cache (without
>>fetching it).  May be used like (2 above); additionally handy for zeroing
>>big chunks of memory.  They also include an "invalidate line"
>>instruction which says: "don't bother writing this one back to memory."

and brandis@inf.ethz.ch (Marc Brandis) writes:

>Unfortunately, IBM made these instructions privileged. They had some good
>reasons to do it, as the instructions ignore lock and protection bits. I do
>not know the reasons why they could not make them check the bits, however.
>
>I am not sure whether having these instructions in user mode would be a great
>advantage. DCLSZ (data cache line set zero) can be used to initialize large
>chunks of memory, of course. The other obvious target for the DCLSZ and CLI
>(cache line invalidate) instructions is to control the allocation and 
>deallocation of procedure frames on the stack so that no memory references
>are generated for newly allocated stack space and that no deallocated stack
>space will be written back to memory. 

Implementing Fortran, I would have used them on large arrays.  When you're
doing one of the BLAS routines and the destination is merely overwritten,
then we can save a lot of cache-misses by not fetching it.  Similarly,
when we're done with a temporary workspace, we may simply invalidate it.

The difficulty is alignment.  It seems difficult to ensure that nothing
extraneous is accidentally zeroed when using long cache lines.

Brandis also make the point that the compiler would have to be parameterized
to account properly for cache line length.  True!  Generally, compilers
are written to the architecture, not the implementation; cache is usually
part of the implementation.  However, instruction schedulers are bending
this idea already.  Further, various cache blocking techniques (often
used at the source level) bend it further.  You have to work hard
for performance.

Summarizing, I can't argue that the RS/6000's instructions are practical
as they stand, and I don't have compiler techniques to use them yet.
However, they (along with HP's cache instructions) are interesting ideas
and probably worth some study.

Preston Briggs