Path: utzoo!utgpu!water!watmath!clyde!bellcore!faline!thumper!ulysses!andante!princeton!udel!gatech!ukma!nrl-cmf!ames!oliveb!pyramid!prls!mips!mash From: mash@mips.UUCP Newsgroups: comp.arch Subject: Re: RISC machines and scoreboarding Message-ID: <2485@winchester.mips.COM> Date: 1 Jul 88 06:07:20 GMT References: <1082@nud.UUCP> <2438@winchester.mips.COM> <1098@nud.UUCP> <2465@winchester.mips.COM> <1111@nud.UUCP> Reply-To: mash@winchester.UUCP (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 64 In article <1111@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes: >In article <2465@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >> a) Are there indeed 2 latency cycles (i.e., that instruction 3 > Yes, 2 on loads. 0 latency on stores. >> b) If so, what is the reason for the second latency slot? > Address calculation. The 88k provides a basic set of three addressing >modes. The most frequently used construct for referencing memory is >"register pointer + offset" (offset is rarely 0). Without providing this basic >addressing mode, the code would have to do something like: > add ptr,ptr,offset ; > ld dest,ptr,0 > sub ptr,ptr,offset ; If pointer must be preserved. >for most loads and stores. This overhead is worse than a 2 tick latency >on loads. Also note that there is 0 latency on 88k store instructions*. >Since many lds have a corresponding st instruction somewhere, the averaged 88k >latency will be < 2 ticks. We agree on the desirability of non-zero offsets 100% (although our friends down the street at AMD may not :-) For this kind of design, I can't imagine that people would do add/ld/sub, but would use a temp register: add temp,ptr,offset ; > ld dest,temp,0 Anyway, maybe I'm missing something: I still don't understand where the second latency cycle comes from in the hardware design. You compute the address and get it on the bus, and the third instruction following can actually use the data. Moto info says that the CMMU returns data in one cycle, so it almost sounds like it's costing a cycle on the CPU to make the data available (i.e., forwarding logic is not fast enough, or something). Anyway, I still don't understand what address-mode computation has to do with it, at least not from the given example [which is identical to what R2000s do]. So please say some more. >* What is the latency on a R3000 store instruction? 0. >>Note that our numbers say that in our machines, it would cost us >>10-15% in overall performance to go from 1 cycle latency to 2, > Assuming the addressing modes and other aspects of the machine remain >the same, this figure is in the ballpark (although a little high). >However, I think the static machine assumption is not a valid one to make. Can you say what your numbers are, i.e., why is this high? My reasoning is as follows: a) Loads are about 20% of the instructions. b) You get to schedule approx. 70% of the 1st delay slots. c) You get to schedule approx 30% of the 2nd delay slots, i.e., the penalty for a second delay slot is the 70% that you don't get to schedule: .7 * 20% = 14% Now, in this area, R2000s and 88Ks are similar enough that I think the statistics carry over, but counterarguments would be useful. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086