Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!sun-barr!texsun!texbell!vector!killer!elg From: elg@killer.Dallas.TX.US (Eric Green) Newsgroups: comp.arch Subject: Re: 80486 vs. 68040 code size [really: how many regs] Message-ID: <8125@killer.Dallas.TX.US> Date: 18 May 89 01:58:06 GMT References: <948@aber-cs.UUCP> Distribution: eunet,world Organization: The Unix(R) Connection, Dallas, Texas Lines: 119 in article <948@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says: > In article <8082@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes: > I program a 68000 a lot. I suspect a 68000 is a fairly typical > reg-memory machine. I write a lot of "C" code. The "C" compiler I'm > using is fairly PCC-like, i.e. loses all register values between > expressions. In performance-critical portions of the code, I end up > This is because you are one of the many C programmers that do not understand > "register" variAbles. Too bad for you and your work. Excuse me? This is a rather outrageous accusation, considering that you have seen neither my code, nor the results from my compiler. As a matter of fact, I am VERY aware of the use of "register" variables. I have to be, when writing performance-critical code. I also have to be aware of how many registers I have, to do, by hand, keeping the results of intermediate calculations in registers (btw: 4 pointer registers, 5 data registers, the rest are used inside expressions). The portability problems are obvious (all my carefully common-subexpression-eliminated code is worthless on another processor, or even with a different compiler). If I could reuse data registers freely, I could write code that a global optimizer couldn't touch (albeit with a helluva lot of work). But there's one problem: Types. On a 68000, shorts and ints are 16 bits, longs are 32 bits. What this means is that if I declare a register int xyz, I can't put a long into it -- the "C" compiler generates a "move.w" instead of a "move.l". If I declare everything as register long xyz, the "C" compiler generates a "add.l" instead of an "add.w", i.e. I just lost all the time I'd saved. > > used little. In a reg-mem architecture little use variables in memory do not > > carry costs as high when you use them. > Foo. If you use the variable three times, you've saved 4 memory > fetches (2 addresses, 2 data) as vs. keeping it in a register. No matter > what kind of machine you're using. > The cost is in adding instructions. John Mashey, that does understand architectures, > said that this is not so important because you can use delay slots > etc... Excuse me? I just said you'd save 4 memory fetches (2 address fetches, 2 data fetches). Where do you get the "added instructions"? Yes, you have an initial "move.l" to get it into the register... but subsequent instructions are normal 16-bit register-to-register instructions, not 16-bit plus 32-bit address (you're not counting the ADDRESS as part of the instruction stream? SHAME!). > Moreover your statement demonstrates of shallow understanding of codes and > chips: Excuse me? Sounds as if you're acting from insufficient information. I've used some of the same arguments that John Mashey et. al. used in the old RISC vs. CISC wars and MIPS vs. SPARC wars. I have not said anything particularly revolutionary, just common net.knowledge, as supported by real.knowledge (i.e., I've read most of the RISC papers that've come down the pike). > Point 2: competent architecture designers (save for those that work for Intel > and other companies that can afford 1.2 millions of transistors :-/) know > that adding registers beyond what is needed has a cost, both (small) in chip > complexity, and in speed (because if you use them you have to save/restore > them at some point in time), which is well known to those that use C > competently: using "register" inappropriately can *slow* your program (on a > reg-mem machine). "Smart" compilers only save/restore registers as needed. For example, the MIPS compiler doesn't save/restore registers for "leaf" functions, i.e., those functions that call no other functions (I believe the MIPS folks said that "leaf" functions account for 40% or so of all functions, but you'll have to ask them). As for the rest of your arguments, that's why they invented windowed register stacks, and AMD29000 style explicit windowing. A flat register architecture isn't the only way to go, although MIPS has found a flat register architecture with 32 registers to be quite adequate. > Again, as to the last point, if you have enough registers, the register file > becomes a kind of first level memory, so you can keep your stack, and globals > there. But then you pay in system terms at context switching time. a hundred context switches per second vs. several thousand subroutine calls per second? I'll optimize the subroutine calls, thank you! We hashed all this out a couple of years ago, during the last RISC/CISC/MIPS/SPARC wars. > stack-stack must be reg-reg in order to be adequately fast. > ^^^^ > But note that program-memory bandwidth is the one thing there's no > shortage of. > > Enough of this unsupported nonsense... Try read something about the machines > above. Don't believe every urban legend you hear :->. I've read the CRISP paper, and several other stack-stack papers. In all of them they mention that caching the top entries of the stack in hardware registers was a Big Win performance wise. All I said was that a register is a register, whether it is accessed as a "stack" or explicitly as a register. As for the statement "program-memory bandwidth is the one thing there's no shortage of", I point you towards: a) large cache memories, b) locality of reference (>90%, with a large cache), and c) interleaved memories, which allow you to execute sequentially, using slow memories, with no performance hit (but when you hit a branch, you may have a major performance hit -- which is why reducing the number of branches in a RISC is a Good Thing). None of this is particular new or revolutionary. Seymour Cray has been doing it since the late 60's. The only new thing is that these techniques are now being used in desktop computers, by, amongst other, MIPS, AMD, Sun(Sparc) and Motorola (68040, 88000). So while "essentially unlimited" is perhaps a bit strong, I (and most RISC advocates) still maintain that the number of instructions, and the size of each instruction, are NOT the limiting factor insofar as performance goes. Good reference to "How fast can we fetch opcodes?": "A VLIW architecture for a trace scheduling compiler" Robert P. COlwell et. al., CAM proceedings vol 15 #5 p180 -- | // Eric Lee Green P.O. Box 92191, Lafayette, LA 70509 | | // ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg (318)989-9849 | | // Join the Church of HAL, and worship at the altar of all computers | |\X/ with three-letter names (e.g. IBM and DEC). White lab coats optional.|