Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!mips!winchester!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Cache Line Fills -- Critical Word First Message-ID: <41856@mips.mips.COM> Date: 2 Oct 90 04:18:47 GMT References: <34275@cup.portal.com> <14780@cbmvax.commodore.com> Sender: news@mips.COM Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Inc. Lines: 128 In article <14780@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: >In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: >>For example, if the miss is on address xxx1, the 486 will fill the cache line >>in the order 1-0-3-2. On SPARC or the 68040, the read continues to the end >>of the block and wrapsaround, i.e. 1-2-3-0. > >>Does this little design tweak pay off? Just recently, Motorola has provided >>proof! They've introduced two static RAM's with on-chip burst control logic >>optimized for use as cache memories. The MCM62940 is compatible with >>conventional burst ordering. The MCM62486 uses Intel-style burst ordering. >>The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns. >>(Max. access time.) The MCM62486 is offered in: 14, 19, and 24 ns. Intel >>wins by a nanosecond! > > However, does having to wait for the last word of the burst for the >next entry (#2) cost you anything? For instructions at least, there is >a definite tendency towards consecutive access (and even somewhat in data, >when operating on blocks such as strings or copies). This is discussed in Patterson & Hennessy, near end of Chapter 8. They note it doesn't gain as much as you'd think. How much this is worth depends a lot on the following attributes, at least: a) I-cache versus D-cache b) the pipeline structure instruction latencies issue parallelism c) the memory system cache structure & buffering main memory latency & number of pipes to it For the I-cache, when you have a cache miss: I1) You can stall the machine, fetch the entire cache block, then restart. This is clearly the simplest. I2) You can do "early restart", where you begin executing as soon as the requested word is available. This is sometimes called "Instruction Streaming" (in the MIPS R3000), i.e., when you cache miss: start fetching at word 0 of the block stall until the needed word is fetched, then stream if you branch elsewhere before the end of the block, stop streaming, stall the pipeline until block filled also, if a load/store causes a cache miss, complete the I-cache refill, then handle the D-cache miss I3) You can do "out-of-order fetch" in addition to early restart, and then do "wrapped fetch", so that you wrap-around to complete. Cache miss penalty = access time + transfer time. In all of these cases, the cache miss penalty is the same, but I2 and I3 try to "cover" some of the transfer time by converting stalls into issues. (It's hard to do much about access time for I-cache misses :-) I2 and I3 are the same, except that I3 gets to issue instructions earlier. This doesn't do much for short-latency operations, but can help those with longer latency, especially if there are multiple independent units. In practice, this means that: - The approach probably helps floating point codes more than integer - It helps pipelined-FP units more than low-latency scalar ones - It can REALLY help the class of program that consists of a giant loop filled with sequential code larger than the cache size. (Now you see the real reason why such features are included, as this class of program happens be very important to chip designers, and they take care of themselves :-) For the I-cache, it is known that the first word of a cache block accounts for an unusually high share of the misses, due to fall-thru. I think this is true across a wide range of programs. Hence, there is even less difference between I2 and I3 than you'd expect, since they behave identically in ths case. I can't recall the numbers, but I think the simulations done when the R3000 was being desinged showed relatively little difference, but note that the R3000's FP operations are fairly low latency also, hence woudl helped less than some others designed then. For the D-cache, it's more of the same, except with more options, and more complexity, as there are many more options for D-cache implementation. Here are few of the relevant items: - Unlike the I-cache, where word 0 of a block is more likely cause a cache miss, and this is true of almost every program, the effect is much more program-dependent for the D-cache. Some programs make many sequential references likely to cause cache misses, others much less so. - Unlike the I-cache, where successive instructions continue accessing the same block, loads and stores may well access different blocks, and this implies extra complexity or buffering. For example, if you have: load r1,a load r2,a+4 load r3,a+8 etc then using read-data-streaming works real well, as does: load r1,a load r2,a+4 div r1,r3 etc where you get to start the divide early On the other hand, load r1,a load r2,b b in different cache block store r1,c c in third block gets awkward. Although a load of a+4 is probably referencing an internal cache-line buffer, the load of b is referencing the cache, and would do so at the same time as the cache is being refilled. Should the refill have priority, or the load of b? Some of your choices depend on how many memory pipes you have. Big machines, with long memory latencies (in terms of cycle counts), heavy pipelining for FP, and multiple memory pipes, can contemplate more parallelism, than makes sense for a micro with 1 memory pipe, where most attempts to "get ahead" quickly run into the constriction of the single pipe. All of this, especially the D-cache choices, is likely to be very benchmark-dependent. I'm sure there are programs where you can get 2X better by wrap-around with early restart, and otheres where you get little or nothing. All of this is just a long way of saying there is no simple answer to the question "is this worth it?" and that finding the answer takes detailed simulation of many, big programs running on simulators that know every detail of the pipeline, cache, etc. (I've often commented in public talks that we had something like 400-500 VAX-mips used for the design of some future chip. The designers beat me up, saying it's more like 2X that these days...) -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086