Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rutgers!ames!oliveb!pyramid!voder!apple!bcase From: bcase@apple.UUCP (Brian Case) Newsgroups: comp.arch Subject: Re: Branch Target Cache vs. Instruction Cache Message-ID: <781@apple.UUCP> Date: Sun, 17-May-87 21:52:02 EDT Article-I.D.: apple.781 Posted: Sun May 17 21:52:02 1987 Date-Received: Mon, 18-May-87 04:01:52 EDT References: <3810030@nucsrl.UUCP> <491@necis.UUCP> <3530@spool.WISC.EDU> <767@apple.UUCP> <397@dumbo.UUCP> Reply-To: bcase@apple.UUCP (Brian Case) Organization: Apple Computer Inc., Cupertino, USA Lines: 155 In article <397@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes: >Let me start by apologizing for the tone of my last message - I'll try to be >less caustic and more informative this time. Yeah, I should appologize too for over-reacting. >In article <767@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes: > (about how a branch target cache performs comparably to an instruction > cache, in a real system environment) >> it can perform comparably because, logically, part of the instruction cache >> is in the external rams. the branch target cache is only there to cover the >> latency of the initial access in that external ram >My point was that the four words of information in the BTC is not sufficient >to cover the latency of DRAM accesses in a real system environment. I would >freely acknowlege that it _is_ sufficient if the next level store is SRAM - >however, in that case, the BTC is a _suppliment_ to the SRAM-based >instruction cache. In fact, the system that is simulated by AMD in their >performance simulation has an external cache. How about 80ns 256K DRAMs, interleaved two-way. With these RAMs (available at Fry's for $7 each, so proabably for $4 from a real suppler), the Am29000 need have only *two* instructions in each branch target entry. We're not even talking about video DRAMs, which are probably cheaper than these high-speed guys. But, video DRAMs are available at least in 120 ns versions, three 25 MHz Am29000 cycles, so it should still be possible, even considering that extra control logic may add a cycle to the first access. I know that such RAMs are more expensive than pedestrian 256K DRAMs (by about a factor of 2), and this is a real consideration for some people, but a full main memory system can be built for real computers with these parts (look inside a Macintosh Plus or Mac II (UNIX, but not fantastic performance) for a lesson in minmalist deisgn). >As to defending a traditional instruction cache relative to the branch >target cache, this is a apples-and-orange comparison. The instruction cache >leaves main memory bandwidth free for other purposes, such as data >references, memory refresh, and I/O traffic. What I could compare is the But, with separate instruction and data paths to the processor (the case of the Am29000), leaving main memory bandwidth free for other purposes is much less an issue. Things like I/O traffic had better not be directly visible on the Am29000 bus (or the processor/memory bus of ANY high speed processor chip) if high performance is to be obtained. >the sign bit of a register, so we have to add time for the comparison >operations (.36 cycles for compare equal/not equal, and .08 for full >arithmetic compares), so the real total is 1.38+.36+.08 = 1.82 cycles per >branch. (Numbers are from the paper, and assume an 80% hit ratio in the BTC) It is not fair to add the time taken to do the compare to the time taken for the branch: sometimes the compare is overlapped with something else (it can fit in the delay slot of another delayed branch or can be overlapped with a load or store). However, since this overlapping will not nullify all compares (or perhaps even a significant portion of them), you are right to charge some cost for them relative to a MIPSCO-type branch. >In summary: > >Branch Scheme average cycles per branch >======================================================= >Fast Compare (R2k) 1.59 >64-entry BTC/sign compare (29k) 1.82 > >Selecting between these schemes, holding all else constant gives the fast >compare scheme about a 2% edge in overall performance over the BTC scheme. >However, McFarling & Hennessey give this warning about the fast compare >scheme: "...the timing of the simple compare is a concern, because it must >complete in time to change the instruction address going out in the next >cycle. This could easily end up as the critical path controlling instruction >cache access and, hence, cycle time." They are right; the timing was a >concern, and we had to put in special hardware to perform the fast compare >and to quickly translate branch addresses to physical addresses. It is this >translation hardware (Micro-TLB) that adds about 1% of a cycle per >instruction or 6% per branch. However, the critical paths are on-chip ones, >and can be expected to scale with the technology. Yeah, MIPSCO may have done a good thing here; we'll all have to wait for the next technologies to know for sure. I also think the MIPSCO approach reflects the facts that: (1) you guys are going for a slightly (if not significantly) different market and (2) you guys have control over almost everything in your system environment, by virtue of being able to write the compilers, write the operating systems, and build the system hardware. This is a significant advantage in the UNIX market. At AMD, things were much less under control, so we opted for features that are easy to understand, clearly scalable with technology (no ifs, ands, or buts), and modular (i.e., it is easy to take the TLB out of the Am29000; maybe it is as easy to take it out of the MIPS, I haven't really thought about it.). The separate instruction and data buses clearly scale easily with clock speed, at least up to a point. 30 MHz Am29000s will be no problem and should be available (I am speculating) soon after 25 MHz parts (one has to wonder if the RAMs can keep up, but this isn't just the Am29000's problem). >Time and considerable further development will occur before AMD can supply >performance data from running large benchmarks under real system conditions, >and I understand that we'll have to make do with what we've got. However >the only benchmark of potentially meaningful size from AMD (sipasm) performs >substantially less than 17 (780-relative) MIPS with an external cache as >well as burst-mode memory, and as I understand it, these results do not take >multiprogramming cache effects and finite memory write speed into account. I think Tim responded to this with some hard data. Perhaps there was some confusion (and I didn't do anything to clear it up before): the Am29000/ VDRAM combination *IS* lower performance than the Am29000/cache combination, at least on average. However, for some grahpics benchmarks, at least, the Am29000/VDRAM combination would surprise you (anyway, it surprised me and I tend (as of late) to be optimistic!). The Am29000 probably makes a good UNIX-box CPU too, but this is much more debatable until a proof-by- existence can be constructed. >I'm at a disadvantage in quoting performance data for the AMD part, but the >branch target cache miss rate on the larger benchmarks is in the 50% range, >is it not? That would mean that much of the branch behaviour of these larger >programs is not simple loops, and the analogy of the AMD 29k BTC holding 32 >loops of any size isn't really valid. Well, holding 32 loops of any size isn't really a big win anyway since we all know that loops tend to be smaller than "any size." I guess I overstated that one a bit. >What looks bad for the AMD part is >that branches that miss in the BTC get a four-cycle delay. (Please correct >me if I'm wrong on this, but I'm assuming that the 4 words in each BTC entry >reflect that it will take at least 4 cycles to start up again after a BTC >miss.) At risk of contradicting statement made earlier in this message, >that would cause the average branch time to be about 3.5 cycles. You are off here. The 4 words in each BTC entry are there so that *up to* 4 cycles of initial latency can be overlapped. If the external memory responds to the initial request sooner, so much the better, especially when the BTC misses. So, a BTC miss incurs the latency of the external instruction memory, not a fixed 4 cycles. >As to what MIPS will do to make external caches work at 25, 30, 35 MHz, >I'm afraid this isn't the right forum to be discussing our future >products. Understood; anyway I didn't mean that the MIPS bus was un-fixable. Clearly there are (some easy) things which will fix the problem. >We have publicly stated that we will improve the performance >of our products at a rate of doubling performance every eighteen months, >and our current product plans are running faster than that rate. >If I could ask, what happens to the Am29k bus at 40 to 55 MHz? Well, if it were left alone, then its cycle time would scale and a device would have only 25 to 18 ns to respond. 25 is sorta reasonable, but 18 sounds pretty silly. Probably the buses will have to be made wider so that adequate bandwidth can be had with reasonable (bus) cycle times. By the time technology (at least at AMD, and I am speculating now since I don't work there anymore) gets to Am29000s with 55 MHz clocks, there will be better packaging technology in the main stream (which is where it must be if commodity parts like the Am29000 are to use it). bcase