Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!gatech!amdcad!tim
From: tim@amdcad.UUCP
Newsgroups: comp.arch
Subject: Re: Branch Target Cache vs. Instruction Cache
Message-ID: <16687@amdcad.AMD.COM>
Date: Sat, 16-May-87 14:02:07 EDT
Article-I.D.: amdcad.16687
Posted: Sat May 16 14:02:07 1987
Date-Received: Sat, 16-May-87 22:02:24 EDT
References: <3810030@nucsrl.UUCP> <491@necis.UUCP> <3530@spool.WISC.EDU> <397@dumbo.UUCP>
Organization: Advanced Micro Devices, Inc., Sunnyvale, Ca.
Lines: 163

In article <397@dumbo.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:

+-----
| My point was that the four words of information in the BTC is not sufficient
| to cover the latency of DRAM accesses in a real system environment. I would
| freely acknowlege that it _is_ sufficient if the next level store is SRAM -
| however, in that case, the BTC is a _suppliment_ to the SRAM-based
| instruction cache. In fact, the system that is simulated by AMD in their
| performance simulation has an external cache.
+-----
You correctly point out one of the uses of the BTC -- a suppliment to an
external instruction cache.  However, it is certainly not the case that
all Am29000 systems will be Unix supermicro's with 32 meg of main memory
and ecc.  In many potential applications, a small (1-6 MB) amount of
main memory is sufficient.  In these applications, video-DRAM based
memory designs can perform 4-cycle accesses with single-cycle burst
accesses for the instruction stream.  The BTC is then used to cover the
latency of a branch while the first instruction access is performed to
the memory.

+-----
| I quote from the McFarling and Hennessey paper "Reducing the cost of
| branches," (13th Annual Int. Symp. on Computer Architecture, June 1986) the
| following statement: "In our simulations, we noticed that a direct mapped
| BTB and an instruction cache of the same size had about the same hit ratio."
+-----
It is very hard to generalize the relative performance of a BTC vs a
standard instruction cache for all machines and cache sizes. The term
"hit ratio" doesn't even mean the same thing.  When we calculate hit
ratio for the BTC, it is done only for branch instructions and other
"branch-like" operations, such as interrupt vectoring and returning.  If
there is a miss, the subsequent instructions which are then cached are
*not* counted as hits, as would be the case for a standard instruction cache.
Given the limited chip area we could afford, the BTC gives us better
overall *performance* than a similar-sized instruction cache.

+-----
| Time and considerable further development will occur before AMD can supply
| performance data from running large benchmarks under real system conditions,
| and I understand that we'll have to make do with what we've got.  However
| the only benchmark of potentially meaningful size from AMD (sipasm) performs
| substantially less than 17 (780-relative) MIPS with an external cache as
| well as burst-mode memory, and as I understand it, these results do not take
| multiprogramming cache effects and finite memory write speed into account.
+-----

Here are the some numbers for some "substantial" programs.

	(*NOTE* -- all MIPS numbers are Am29000 MIPS.  17 Am29000 MIPS ~=
	 15 VAX 11/780 MIPS, but again, this is hard to generalize. 
	 Craig correctly points out that multiprogramming cache effects
	 were not taken into account, mainly because to do so accurately
	 requires a full multitasking kernel simulation and a "general"
	 simulated machine load, whatever that is.)

	First system specification:
	
	separate, 64K byte external instruction & data caches -- 2 cycle
	access with single-cycle burst capability.  4-cycle main memory
	*access time* with single-cycle burst (ie video DRAM).  Data
	cache is write-through with a *single* write-buffer (finite
	memory write speed *is* taken into account).

Assembler:
Statistics of "simasm" simulation:

User Mode:		  377268 cycles	(0.01509072 seconds)
Supervisor Mode:	   25497 cycles	(0.00101988 seconds)
Total:			  402765 cycles	(0.01611060 seconds)
Simulation speed:	 16.25 MIPS (1.54 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nroff:
Statistics of "nroff" simulation:

User Mode:		  126780 cycles	(0.00507120 seconds)
Supervisor Mode:	    3226 cycles	(0.00012904 seconds)
Total:			  130006 cycles	(0.00520024 seconds)
Simulation speed:	 17.99 MIPS (1.39 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff:
Statistics of "diff" simulation:

User Mode:		  397135 cycles	(0.01588540 seconds)
Supervisor Mode:	     610 cycles	(0.00002440 seconds)
Total:			  397745 cycles	(0.01590980 seconds)
Simulation speed:	 18.33 MIPS (1.36 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Second system specification:
	
	NO external caches. Video-DRAM main memory, 4-cycle access,
	single-cycle instruction burst. 4 cycle loads and stores. (loads
	and stores are not, as yet, scheduled by the compiler.)

Assembler:
Statistics of "simasm" simulation:

User Mode:		  478285 cycles	(0.01913140 seconds)
Supervisor Mode:	   35565 cycles	(0.00142260 seconds)
Total:			  513850 cycles	(0.02055400 seconds)
Simulation speed:	 12.90 MIPS (1.94 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
nroff:
Statistics of "nroff" simulation:

User Mode:		  151836 cycles	(0.00607344 seconds)
Supervisor Mode:	    3998 cycles	(0.00015992 seconds)
Total:			  155834 cycles	(0.00623336 seconds)
Simulation speed:	 14.62 MIPS (1.71 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff:
Statistics of "diff" simulation:

User Mode:		  526944 cycles	(0.02107776 seconds)
Supervisor Mode:	     650 cycles	(0.00002600 seconds)
Total:			  527594 cycles	(0.02110376 seconds)
Simulation speed:	 13.15 MIPS (1.90 cycles per instruction)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see, a substantial fraction of the potential Am29000
performance can be had *without caches* (and most of the difference
would be made up by adding a data cache only).

+-----
| I'm at a disadvantage in quoting performance data for the AMD part, but the
| branch target cache miss rate on the larger benchmarks is in the 50% range,
| is it not? That would mean that much of the branch behaviour of these larger
| programs is not simple loops, and the analogy of the AMD 29k BTC holding 32
| loops of any size isn't really valid.
+-----
Yes, we attempt to cache *all* branch targets, not just loops.  They are
a mix of conditional branch, unconditional branch, and procedure call
targets, as well as interrupt vector heads and instructions following a
page boundary crossing.
 
More numbers:

Assembler:	Branch cache hit ratio:	 42.59% (our lowest)
nroff:		Branch cache hit ratio:	 59.88%
diff:		Branch cache hit ratio:	 86.49%


+-----
|                                   ... What looks bad for the AMD part is
| that branches that miss in the BTC get a four-cycle delay. (Please correct
| me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry
| reflect that it will take at least 4 cycles to start up again after a BTC
| miss.) At risk of contradicting statement made earlier in this message,
| that would cause the average branch time to be about 3.5 cycles.
+-----
No, any BTC misses stall the pipeline only for the amount of time it
takes to fetch the first instruction from external memory. 

+-----
| If I could ask, what happens to the Am29k bus at 40 to 55 MHz?
+-----
It turns into a giant antenna, wiping out everyone's radio reception for
a radius of 10 miles ;-)

	Tim Olson
	Processor Strategic Planning
	Advanced Micro Devices
	(tim@amdcad.amd.com)