Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!ames!oliveb!pyramid!voder!apple!bcase
From: bcase@apple.UUCP (Brian Case)
Newsgroups: comp.arch
Subject: Re: Branch Target Cache vs. Instruction Cache
Message-ID: <781@apple.UUCP>
Date: Sun, 17-May-87 21:52:02 EDT
Article-I.D.: apple.781
Posted: Sun May 17 21:52:02 1987
Date-Received: Mon, 18-May-87 04:01:52 EDT
References: <3810030@nucsrl.UUCP> <491@necis.UUCP> <3530@spool.WISC.EDU> <767@apple.UUCP> <397@dumbo.UUCP>
Reply-To: bcase@apple.UUCP (Brian Case)
Organization: Apple Computer Inc., Cupertino, USA
Lines: 155

In article <397@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>Let me start by apologizing for the tone of my last message - I'll try to be
>less caustic and more informative this time.

Yeah, I should appologize too for over-reacting.

>In article <767@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes:
> (about how a branch target cache performs comparably to an instruction
>    cache, in a real system environment)
>> it can perform comparably because, logically, part of the instruction cache
>> is in the external rams.  the branch target cache is only there to cover the
>> latency of the initial access in that external ram
>My point was that the four words of information in the BTC is not sufficient
>to cover the latency of DRAM accesses in a real system environment. I would
>freely acknowlege that it _is_ sufficient if the next level store is SRAM -
>however, in that case, the BTC is a _suppliment_ to the SRAM-based
>instruction cache. In fact, the system that is simulated by AMD in their
>performance simulation has an external cache.

How about 80ns 256K DRAMs, interleaved two-way.  With these RAMs (available
at Fry's for $7 each, so proabably for $4 from a real suppler), the Am29000
need have only *two* instructions in each branch target entry.

We're not even talking about video DRAMs, which are probably cheaper than
these high-speed guys.  But, video DRAMs are available at least
in 120 ns versions, three 25 MHz Am29000 cycles, so it should still be
possible, even considering that extra control logic may add a cycle to
the first access.  I know that such RAMs are more expensive than
pedestrian 256K DRAMs (by about a factor of 2), and this is a real
consideration for some people, but a full main memory system can be built
for real computers with these parts (look inside a Macintosh Plus or
Mac II (UNIX, but not fantastic performance) for a lesson in minmalist
deisgn).

>As to defending a traditional instruction cache relative to the branch
>target cache, this is a apples-and-orange comparison. The instruction cache
>leaves main memory bandwidth free for other purposes, such as data
>references, memory refresh, and I/O traffic. What I could compare is the

But, with separate instruction and data paths to the processor (the case
of the Am29000), leaving main memory bandwidth free for other purposes
is much less an issue.  Things like I/O traffic had better not be directly
visible on the Am29000 bus (or the processor/memory bus of ANY high speed
processor chip) if high performance is to be obtained.

>the sign bit of a register, so we have to add time for the comparison
>operations (.36 cycles for compare equal/not equal, and .08 for full
>arithmetic compares), so the real total is 1.38+.36+.08 = 1.82 cycles per
>branch. (Numbers are from the paper, and assume an 80% hit ratio in the BTC)

It is not fair to add the time taken to do the compare to the time taken
for the branch:  sometimes the compare is overlapped with something else
(it can fit in the delay slot of another delayed branch or can be overlapped
with a load or store).  However, since this overlapping will not nullify
all compares (or perhaps even a significant portion of them), you are right
to charge some cost for them relative to a MIPSCO-type branch.
>In summary:
>
>Branch Scheme		average cycles per branch
>=======================================================
>Fast Compare (R2k)		1.59
>64-entry BTC/sign compare (29k)	1.82
>
>Selecting between these schemes, holding all else constant gives the fast
>compare scheme about a 2% edge in overall performance over the BTC scheme.
>However, McFarling & Hennessey give this warning about the fast compare
>scheme: "...the timing of the simple compare is a concern, because it must
>complete in time to change the instruction address going out in the next
>cycle. This could easily end up as the critical path controlling instruction
>cache access and, hence, cycle time." They are right; the timing was a
>concern, and we had to put in special hardware to perform the fast compare
>and to quickly translate branch addresses to physical addresses.  It is this
>translation hardware (Micro-TLB) that adds about 1% of a cycle per
>instruction or 6% per branch.  However, the critical paths are on-chip ones,
>and can be expected to scale with the technology.

Yeah, MIPSCO may have done a good thing here; we'll all have to wait for
the next technologies to know for sure.  I also think the MIPSCO approach
reflects the facts that: (1) you guys are going for a slightly (if not
significantly) different market and (2) you guys have control over almost
everything in your system environment, by virtue of being able to write
the compilers, write the operating systems, and build the system hardware.
This is a significant advantage in the UNIX market.  At AMD, things were
much less under control, so we opted for features that are easy to
understand, clearly scalable with technology (no ifs, ands, or buts),
and modular (i.e., it is easy to take the TLB out of the Am29000; maybe
it is as easy to take it out of the MIPS, I haven't really thought about
it.).  The separate instruction and data buses clearly scale easily with
clock speed, at least up to a point.  30 MHz Am29000s will be no problem
and should be available (I am speculating) soon after 25 MHz parts (one
has to wonder if the RAMs can keep up, but this isn't just the Am29000's
problem).

>Time and considerable further development will occur before AMD can supply
>performance data from running large benchmarks under real system conditions,
>and I understand that we'll have to make do with what we've got.  However
>the only benchmark of potentially meaningful size from AMD (sipasm) performs
>substantially less than 17 (780-relative) MIPS with an external cache as
>well as burst-mode memory, and as I understand it, these results do not take
>multiprogramming cache effects and finite memory write speed into account.

I think Tim responded to this with some hard data.  Perhaps there was some
confusion (and I didn't do anything to clear it up before):  the Am29000/
VDRAM combination *IS* lower performance than the Am29000/cache combination,
at least on average.  However, for some grahpics benchmarks, at least,
the Am29000/VDRAM combination would surprise you (anyway, it surprised
me and I tend (as of late) to be optimistic!).  The Am29000 probably makes
a good UNIX-box CPU too, but this is much more debatable until a proof-by-
existence can be constructed.

>I'm at a disadvantage in quoting performance data for the AMD part, but the
>branch target cache miss rate on the larger benchmarks is in the 50% range,
>is it not? That would mean that much of the branch behaviour of these larger
>programs is not simple loops, and the analogy of the AMD 29k BTC holding 32
>loops of any size isn't really valid.

Well, holding 32 loops of any size isn't really a big win anyway since we
all know that loops tend to be smaller than "any size."  I guess I
overstated that one a bit.

>What looks bad for the AMD part is
>that branches that miss in the BTC get a four-cycle delay. (Please correct
>me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry
>reflect that it will take at least 4 cycles to start up again after a BTC
>miss.) At risk of contradicting statement made earlier in this message,
>that would cause the average branch time to be about 3.5 cycles.

You are off here.  The 4 words in each BTC entry are there so that *up to*
4 cycles of initial latency can be overlapped.  If the external memory
responds to the initial request sooner, so much the better, especially
when the BTC misses.  So, a BTC miss incurs the latency of the external
instruction memory, not a fixed 4 cycles.

>As to what MIPS will do to make external caches work at 25, 30, 35 MHz,
>I'm afraid this isn't the right forum to be discussing our future
>products.

Understood; anyway I didn't mean that the MIPS bus was un-fixable.
Clearly there are (some easy) things which will fix the problem.

>We have publicly stated that we will improve the performance
>of our products at a rate of doubling performance every eighteen months,
>and our current product plans are running faster than that rate.
>If I could ask, what happens to the Am29k bus at 40 to 55 MHz?

Well, if it were left alone, then its cycle time would scale and a device
would have only 25 to 18 ns to respond.  25 is sorta reasonable, but
18 sounds pretty silly.  Probably the buses will have to be made wider
so that adequate bandwidth can be had with reasonable (bus) cycle times.
By the time technology (at least at AMD, and I am speculating now since
I don't work there anymore) gets to Am29000s with 55 MHz clocks, there
will be better packaging technology in the main stream (which is where
it must be if commodity parts like the Am29000 are to use it).

    bcase