Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!sun-barr!texsun!texbell!vector!killer!elg
From: elg@killer.Dallas.TX.US (Eric Green)
Newsgroups: comp.arch
Subject: Re: 80486 vs. 68040 code size [really: how many regs]
Message-ID: <8125@killer.Dallas.TX.US>
Date: 18 May 89 01:58:06 GMT
References: <948@aber-cs.UUCP>
Distribution: eunet,world
Organization: The Unix(R) Connection, Dallas, Texas
Lines: 119

in article <948@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
> In article <8082@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>     I program a 68000 a lot. I suspect a 68000 is a fairly typical
>     reg-memory machine. I write a lot of "C" code. The "C" compiler I'm
>     using is fairly PCC-like, i.e. loses all register values between
>     expressions. In performance-critical portions of the code, I end up
> This is because you are one of the many C programmers that do not understand
> "register" variAbles. Too bad for you and your work.

Excuse me? This is a rather outrageous accusation, considering that
you have seen neither my code, nor the results from my compiler. As a
matter of fact, I am VERY aware of the use of "register" variables. I
have to be, when writing performance-critical code. I also have to be
aware of how many registers I have, to do, by hand, keeping the
results of intermediate calculations in registers (btw: 4 pointer
registers, 5 data registers, the rest are used inside expressions). 
    The portability problems are obvious (all my carefully
common-subexpression-eliminated code is worthless on another
processor, or even with a different compiler). 
    If I could reuse data registers freely, I could write code that a
global optimizer couldn't touch (albeit with a helluva lot of work).
But there's one problem: Types. On a 68000, shorts and ints are 16
bits, longs are 32 bits. What this means is that if I declare a
register int xyz, I can't put a long into it -- the "C" compiler
generates a "move.w" instead of a "move.l". If I declare everything as
register long xyz, the "C" compiler generates a "add.l" instead of an
"add.w", i.e. I just lost all the time I'd saved. 

>     > used little. In a reg-mem architecture little use variables in memory do not
>     > carry costs as high when you use them.
>     Foo. If you use the variable three times, you've saved 4 memory
>     fetches (2 addresses, 2 data) as vs. keeping it in a register. No matter
>     what kind of machine you're using.
> The cost is in adding instructions. John Mashey, that does understand architectures,
> said that this is not so important because you can use delay slots
> etc...

Excuse me? I just said you'd save 4 memory fetches (2 address fetches,
2 data fetches). Where do you get the "added instructions"? Yes, you
have an initial "move.l" to get it into the register... but subsequent
instructions are normal 16-bit register-to-register instructions, not
16-bit plus 32-bit address (you're not counting the ADDRESS as part of
the instruction stream? SHAME!). 

> Moreover your statement demonstrates of shallow understanding of codes and
> chips:

Excuse me? Sounds as if you're acting from insufficient information.
I've used some of the same arguments that John Mashey et. al. used in
the old RISC vs. CISC wars and MIPS vs. SPARC wars. I have not said
anything particularly revolutionary, just common net.knowledge, as
supported by real.knowledge (i.e., I've read most of the RISC papers
that've come down the pike). 

> Point 2: competent architecture designers (save for those that work for Intel
> and other companies that can afford 1.2 millions of transistors :-/) know
> that adding registers beyond what is needed has a cost, both (small) in chip
> complexity, and in speed (because if you use them you have to save/restore
> them at some point in time), which is well known to those that use C
> competently:  using "register" inappropriately can *slow* your program (on a
> reg-mem machine).

"Smart" compilers only save/restore registers as needed. For example,
the MIPS compiler doesn't save/restore registers for "leaf" functions,
i.e., those functions that call no other functions (I believe the MIPS
folks said that "leaf" functions account for 40% or so of all
functions, but you'll have to ask them). As for the rest of your
arguments, that's why they invented windowed register stacks, and
AMD29000 style explicit windowing. A flat register architecture isn't
the only way to go, although MIPS has found a flat register
architecture with 32 registers to be quite adequate.

> Again, as to the last point, if you have enough registers, the register file
> becomes a kind of first level memory, so you can keep your stack, and globals
> there. But then you pay in system terms at context switching time.

a hundred context switches per second vs. several thousand subroutine
calls per second? I'll optimize the subroutine calls, thank you! We
hashed all this out a couple of years ago, during the last
RISC/CISC/MIPS/SPARC wars.

>     stack-stack must be reg-reg in order to be adequately fast.
> 		^^^^
>     But note that program-memory bandwidth is the one thing there's no
>     shortage of.
> 
> Enough of this unsupported nonsense... Try read something about the machines
> above. Don't believe every urban legend you hear :->.

I've read the CRISP paper, and several other stack-stack papers. In
all of them they mention that caching the top <n> entries of the stack
in hardware registers was a Big Win performance wise. All I said was
that a register is a register, whether it is accessed as a "stack" or
explicitly as a register. As for the statement "program-memory
bandwidth is the one thing there's no shortage of", I point you
towards: a) large cache memories, b) locality of reference (>90%, with
a large cache), and c) interleaved memories, which allow you to
execute sequentially, using slow memories, with no performance hit
(but when you hit a branch, you may have a major performance hit --
which is why reducing the number of branches in a RISC is a Good
Thing). None of this is particular new or revolutionary. Seymour Cray
has been doing it since the late 60's. The only new thing is that
these techniques are now being used in desktop computers, by, amongst
other, MIPS, AMD, Sun(Sparc) and Motorola (68040, 88000).

So while "essentially unlimited" is perhaps a bit strong, I (and most
RISC advocates) still maintain that the number of instructions, and
the size of each instruction, are NOT the limiting factor insofar as
performance goes. 

Good reference to "How fast can we fetch opcodes?": 
   "A VLIW architecture for a trace scheduling compiler"
   Robert P. COlwell et. al., CAM proceedings vol 15 #5 p180

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
|  //    Join the Church of HAL, and worship at the altar of all computers  |
|\X/   with three-letter names (e.g. IBM and DEC). White lab coats optional.|