Newsgroups: comp.arch Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!wuarchive!mit-eddie!uw-beaver!rice!ariel.rice.edu!preston From: preston@ariel.rice.edu (Preston Briggs) Subject: Re: cache pre-load/no-load instructions Message-ID: <1991Mar21.161044.2898@rice.edu> Sender: news@rice.edu (News) Organization: Rice University, Houston References: <765@ajpo.sei.cmu.edu> Date: Thu, 21 Mar 91 16:10:44 GMT jonathan@cs.pitt.edu (Jonathan Eunice) writes: >>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from >> >>1) cache pre-load instructions (the compiler inserts these into the >>instr stream, and hopefully, the appropriate cache line will be available >>by the time it's needed, avoiding delays and speeding up single-task >>execution) >> >>2) cache no-load hints as a part of store instructions (useful to avoid >>useless cache loading for initialization statements, for faster program >>startup, and perhaps in other situations too) At the upcoming ASPLOS, there's a paper called "Software Prefetching", by Callahan, Kennedy, and Porterfield, describing compiler mechanisms to take advantage of cache pre-fetch instructions (1 above). They seem very effective for scientific code. The RS/6000 includes 2 interesting possibilities. An instruction that zeroes a line in the data cache (without fetching it). May be used like (2 above); additionally handy for zeroing big chunks of memory. They also include an "invalidate line" instruction which says: "don't bother writing this one back to memory." >>How effective are these optimizations likely to be? (While they aren't going >>to give the same kind of speedup as making the system super-scalar or >>super-pipelined, they strike me as effective tweaks.) This sort of thing can be very important. One of the basic problems of the i860 (for an example) is its low off-chip memory bandwidth, at least in relation to it's FP performance. Instruction-level parallelism (piplines, wide instructions, superscalar, speculative execution) is ok for getting the FP performance up, but the processor will starve without lots of bandwidth. Preston Briggs