Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!tut.cis.ohio-state.edu!sei.cmu.edu!ajpo!carters From: carters@ajpo.sei.cmu.edu (Scott Carter) Newsgroups: comp.arch Subject: Re: cache pre-load/no-load instructions Summary: Not so easy to make use of with GP compilers Message-ID: <765@ajpo.sei.cmu.edu> Date: 21 Mar 91 04:22:19 GMT References: Reply-To: carter%csvax.decnet@mdcgwy.mdc.com Organization: McDonnell Douglas Electronic Systems Company Lines: 58 In article jonathan@cs.pitt.edu (Jonathan Eunice) writes: >Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from >HP are: > >1) cache pre-load instructions (the compiler inserts these into the >instr stream, and hopefully, the appropriate cache line will be available >by the time it's needed, avoiding delays and speeding up single-task >execution) > >2) cache no-load hints as a part of store instructions (useful to avoid >useless cache loading for initialization statements, for faster program >startup, and perhaps in other situations too) > >How effective are these optimizations likely to be? (While they aren't going >to give the same kind of speedup as making the system super-scalar or >super-pipelined, they strike me as effective tweaks.) > A military machine we're working on (whose name may Not be Mentioned) has some similar capabilities. BTW, load to R0 may well turn out to be a cache preload in some RISCs, depending on how the pipe control is implemented. Definitely an unsupported feature :} The technology we had to use had small caches and fairly slow memory, so minimizing miss penalty certainly counted, as did not knocking a line out with another line from which you only needed one word, and whose locality was poor. Both individual loads and pages can be marked as no-allocate (the data comes from the cache if there is a hit, but avoids the cache lockup [and replace] on a miss). The cache is physically addressed, so we can have allocating and non-allocating aliases of important data structures. This is mostly useful with large arrays which are sometimes addressed row-wise and sometimes column-wise. The performance gain on matrix-multiplication type operations (which we spend a lot of time doing) is fairly good versus just treating the off-stride access as uncached. The pipeline control to handle aborting a cache preload when a real miss comes along is fairly unpleasant. There's no store hinting because our cache is writethrough, non-allocating. Note that the application programmer has to insert compiler pragmas to make use of this (though the pipeline scheduler does have some heuristics about which loads to promote the most). Methinks fully heuristic compilers for this are still a research topic. See CARP at Purdue, for example. >Does anyone else have them? I seem to recall a posting to the effect that >the RS/6000 POWER architecture does not. What about MIPS, SPARC, etc? Is >this a me-too feature? I don't know of any other GP ISAs that have this. Scott Carter - McDonnell Douglas Electronic Systems Company carter%csvax.decnet@mdcgwy.mdc.com (preferred and faster) - or - carters@ajpo.sei.cmu.edu (714)-896-3097 The opinions expressed herein are solely those of the author, and are not necessarily those of McDonnell Douglas.