Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!tut.cis.ohio-state.edu!sei.cmu.edu!ajpo!carters
From: carters@ajpo.sei.cmu.edu (Scott Carter)
Newsgroups: comp.arch
Subject: Re: cache pre-load/no-load instructions
Summary: Not so easy to make use of with GP compilers
Message-ID: <765@ajpo.sei.cmu.edu>
Date: 21 Mar 91 04:22:19 GMT
References: <JONATHAN.91Mar17034438@speedy.cs.pitt.edu>
Reply-To: carter%csvax.decnet@mdcgwy.mdc.com
Organization: McDonnell Douglas Electronic Systems Company
Lines: 58

In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
>HP are:
>
>1)  cache pre-load instructions (the compiler inserts these into the
>instr stream, and hopefully, the appropriate cache line will be available
>by the time it's needed, avoiding delays and speeding up single-task 
>execution) 
>
>2) cache no-load hints as a part of store instructions (useful to avoid
>useless cache loading for initialization statements, for faster program
>startup, and perhaps in other situations too)
>
>How effective are these optimizations likely to be?  (While they aren't going
>to give the same kind of speedup as making the system super-scalar or 
>super-pipelined, they strike me as effective tweaks.)  
>

A military machine we're working on (whose name may Not be Mentioned) has some
similar capabilities.  BTW, load to R0 may well turn out to be a cache preload
in some RISCs, depending on how the pipe control is implemented.  Definitely
an unsupported feature :}

The technology we had to use had small caches and fairly slow memory, so
minimizing miss penalty certainly counted, as did not knocking a line out
with another line from which you only needed one word, and whose locality
was poor.

Both individual loads and pages can be marked as no-allocate (the data 
comes from the cache if there is a hit, but avoids the cache lockup [and 
replace] on a miss).  The cache is physically addressed, so we can have
allocating and non-allocating aliases of important data structures.  This is
mostly useful with large arrays which are sometimes addressed row-wise and
sometimes column-wise.  The performance gain on matrix-multiplication type
operations (which we spend a lot of time doing) is fairly good versus
just treating the off-stride access as uncached.

The pipeline control to handle aborting a cache preload when a real miss comes
along is fairly unpleasant.

There's no store hinting because our cache is writethrough, non-allocating.

Note that the application programmer has to insert compiler pragmas to
make use of this (though the pipeline scheduler does have some heuristics
about which loads to promote the most).  Methinks fully heuristic compilers
for this are still a research topic.  See CARP at Purdue, for example.

>Does anyone else have them?  I seem to recall a posting to the effect that
>the RS/6000 POWER architecture does not.  What about MIPS, SPARC, etc?  Is
>this a me-too feature?

I don't know of any other GP ISAs that have this.  

Scott Carter - McDonnell Douglas Electronic Systems Company
carter%csvax.decnet@mdcgwy.mdc.com (preferred and faster) - or -
carters@ajpo.sei.cmu.edu		 (714)-896-3097
The opinions expressed herein are solely those of the author, and are not
necessarily those of McDonnell Douglas.