Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Cache Line Fills -- Critical Word First
Message-ID: <41856@mips.mips.COM>
Date: 2 Oct 90 04:18:47 GMT
References: <34275@cup.portal.com> <14780@cbmvax.commodore.com>
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 128

In article <14780@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes:
>In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>>For example, if the miss is on address xxx1, the 486 will fill the cache line
>>in the order 1-0-3-2.  On SPARC or the 68040, the read continues to the end
>>of the block and wrapsaround, i.e. 1-2-3-0.
>
>>Does this little design tweak pay off?  Just recently, Motorola has provided
>>proof!  They've introduced two static RAM's with on-chip burst control logic
>>optimized for use as cache memories.  The MCM62940 is compatible with
>>conventional burst ordering.  The MCM62486 uses Intel-style burst ordering.
>>The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns.
>>(Max. access time.)  The MCM62486 is offered in:  14, 19, and 24 ns.  Intel
>>wins by a nanosecond!
>
>	However, does having to wait for the last word of the burst for the
>next entry (#2) cost you anything?  For instructions at least, there is
>a definite tendency towards consecutive access (and even somewhat in data,
>when operating on blocks such as strings or copies).

This is discussed in Patterson & Hennessy, near end of Chapter 8.
They note it doesn't gain as much as you'd think.

How much this is worth depends a lot on the following attributes, at least:
a) I-cache versus D-cache
b) the pipeline structure
	instruction latencies
	issue parallelism
c) the memory system
	cache structure & buffering
	main memory latency & number of pipes to it

For the I-cache, when you have a cache miss:
	I1) You can stall the machine, fetch the entire cache block,
	then restart.  This is clearly the simplest.
	I2) You can do "early restart", where you begin executing as soon
	as the requested word is available.  This is sometimes called
	"Instruction Streaming" (in the MIPS R3000), i.e.,  when you
	cache miss:
		start fetching at word 0 of the block
		stall until the needed word is fetched, then stream
		if you branch elsewhere before the end of the block,
		stop streaming, stall the pipeline until block filled
		also, if a load/store causes a cache miss, complete
		the I-cache refill, then handle the D-cache miss
	I3) You can do "out-of-order fetch" in addition to early restart,
	and then do "wrapped fetch", so that you wrap-around to complete.

Cache miss penalty = access time + transfer time.
In all of these cases, the cache miss penalty is the same, but I2 and I3
try to "cover" some of the transfer time by converting stalls into
issues. (It's hard to do much about access time for I-cache misses :-)
I2 and I3 are the same, except that I3 gets to issue
instructions earlier.  This doesn't do much for short-latency operations,
but can help those with longer latency, especially if there are multiple
independent units.  In practice, this means that:
	- The approach probably helps floating point codes more than integer
	- It helps pipelined-FP units more than low-latency scalar ones
	- It can REALLY help the class of program that consists of
	a giant loop filled with sequential code larger than the cache
	size.  (Now you see the real reason why such features are included,
	as this class of program happens be very important to chip designers,
	and they take care of themselves :-)

For the I-cache, it is known that the first word of a cache block accounts
for an unusually high share of the misses, due to fall-thru. I think this
is true across a wide range of programs.  Hence, there is even less
difference between I2 and I3 than you'd expect, since they behave
identically in ths case.  I can't recall the numbers, but I think the
simulations done when the R3000 was being desinged showed relatively little
difference, but note that the R3000's FP operations are fairly low
latency also, hence woudl helped less than some others designed then.

For the D-cache, it's more of the same, except with more options,
and more complexity, as there are many more options for D-cache
implementation.  Here are few of the relevant items:

- Unlike the I-cache, where word 0 of a block is more likely cause a cache miss,
and this is true of almost every program, the effect is much more
program-dependent for the D-cache.  Some programs make many sequential
references likely to cause cache misses, others much less so.

- Unlike the I-cache, where successive instructions continue accessing
the same block, loads and stores may well access different blocks,
and this implies extra complexity or buffering.  For example, if you have:
	load	r1,a
	load	r2,a+4
	load	r3,a+8
	etc
then using read-data-streaming works real well, as does:
	load	r1,a
	load	r2,a+4
	div	r1,r3
	etc
where you get to start the divide early

On the other hand,
	load	r1,a
	load	r2,b		b in different cache block
	store	r1,c		c in third block
gets awkward.  Although a load of a+4 is probably referencing an
internal cache-line buffer, the load of b is referencing the cache,
and would do so at the same time as the cache is being refilled.
Should the refill have priority, or the load of b?
Some of your choices depend on how many memory pipes you have.

Big machines, with long memory latencies (in terms of cycle counts),
heavy pipelining for FP, and multiple memory pipes, can contemplate
more parallelism, than makes sense for a micro with 1 memory pipe,
where most attempts to "get ahead" quickly run into the constriction
of the single pipe.

All of this, especially the D-cache choices, is likely to be very
benchmark-dependent.  I'm sure there are programs where you can get
2X better by wrap-around with early restart, and otheres where you
get little or nothing.

All of this is just a long way of saying there is no simple answer
to the question "is this worth it?"  and that finding the answer
takes detailed simulation of many, big programs running on
simulators that know every detail of the pipeline, cache, etc.
(I've often commented in public talks that we had something like 400-500
VAX-mips used for the design of some future chip.  The designers
beat me up, saying it's more like 2X that these days...)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086