Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: The old, tired tirade on "more registers"
Message-ID: <PCG.90Dec25215014@teachk.cs.aber.ac.uk>
Date: 25 Dec 90 21:50:14 GMT
References: <3058@crdos1.crd.ge.COM> <1990Dec19.052338.3911@kithrup.COM>
	<3068@crdos1.crd.ge.COM> <1990Dec19.223934.1568@kithrup.COM>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 200
Nntp-Posting-Host: teachk
In-reply-to: sef@kithrup.COM's message of 19 Dec 90 22:39:34 GMT
X-Old-Subject: Re: Let's pretend
Posting-Front-End: GNU Emacs 18.55.4 of Thu Nov 23 1989 on athene (berkeley-unix)

On 19 Dec 90 22:39:34 GMT, sef@kithrup.COM (Sean Eric Fagan) said:

sef> In article <3068@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com
sef> (bill davidsen) writes:

sef> I meant what I said - "quantify" rather than qualify. Yes optimization
sef> would be better and memory accesses would be down, but how much?

Not a lot... We both know that.

sef> Well... I could suggest you go read any recent (last decade or so)
sef> papers on compiler optimization techniques, which would be chock
sef> full of them.

This is an old diatribe. What I find in the literature is that (for
single threaded architectures) the average "integer" procedure hardly
requires more than 4 spare integer registers, and the average "floating
point" procedure hardly requires more than 4 spare floating point
registers, as the number of values being frequently referenced at any
one time is usually smaller than that.

More registers _can_ be used, but the values so cached are usually
accessed infrequently enough that it does not matter a lot as, once you
are past the knee of the curve, frequency of use drops precipitously.

If you have multiple functional units (superscalar, vector, VLIW, deep
pipelining) multiply by the number of functional units that you can keep
busy at any one time (again, for general purpose codes, the knee of the
curve here is 4), as each functional unit can make use of what is
essentially a separate register set.

Please do not reply with the standard tired, insignificant objections
like "what about loop unrolling, it eats registers fast!". Loop
unrolling is not a goal in itself -- it should be done only when useful,
and it is useful only when there are multiple functional units that can
work in parallel on different stages of the unrollment.

Ergo, back to the multiple functional unit case, in which case you
indeed need what are in effect multiple register sets, even if they are
addressible in the same bank.

sef> Also read papers on the RISC chips, and why the register and
sef> instruction sets were chosen.

Reading these papers carefully reveals the two reasons for which more
registers heve been found to be "useful" in (single thread) RISC
architectures:

The first is that since what are currently called RISCs have a
load-store architecture, the cost of accessing operands not in registers
(a "miss" in the statically managed cache which is a register set) is
much higher than for CISC machines (for example the 486 has a cache that
essentially doubles as large register set as it is just one cycle away
from the main register file), because one needs two instructions to do
it (and the load is likely to stall, because its slot is going to be
needed soon), so, to obtain the same _average_ access time to a value,
one has to have better hit rates. This requires *many* more registers,
because once the knee of the curve has been passed (and that knee is at
4), hit rates increase very slowly.

The second is that historically (but this is becoming less and less
true) fashionable RISCs have relied on "clever" compiler technology to
optimize reference patterns. The problem was that such "clever"
technology optimized _static_ reference patterns, by putting into
registers values frequently mentioned in the program *text*, i.e.
reducing code size, not _dynamic_ reference patterns, by putting into
registers values frequently referenced during execution, i.e.  reducing
run time.

Normally there is little correlation between number of static and
dynamic references of a value, so in practice the best way to make
static reference optimization also optimize dynamic references has been
to have register sets so large that they cache all the values statically
referenced in a function.

In practice this means that even static reference optimization itself is
usually unnecessary, because equivalent results could be obtained by
just register'ing all values without any static or dynamic analysis.
Papers exist to show that the number of values referenced at all in any
one section of code rarely exceed 20, so a 32 element strong register
set is virtually guaranteed to result in optimal caching by simply
caching all values, without requiring any analysis whatsoever.


It can be argued that both reasons above are *problems/misdesigns* with
what currently passes for RISC, and thus are no reason to argue for large
register sets.

One defense however can be made; RISC wins because it is about
streamlined and opportunistic implementation (getting the CPI down for
frequently executed instructions) and a load-store architecture helps
with faster decoding.  Other alternative organizations are conceivable
that also have fast decoding and keep the CPI down (multiple stacks,
almost-zero-address machines). Load-store does however deliver faster
decoding, even if at the price of requiring loads of otherwise
unnecessary registers. "Clever" optimization is just laughable -- papers
exist to demonstrate that simple heuristics (not even feedback profiling
like MIPS or MultiFlow) usually predict fairly well which few values are
going to be frequently referenced at run time.


Now let's look at one example that you purport to show that more
registers would be useful on the 386:

sef> Here is a sample of code:
sef> 	r2 = r3 = inb (0x3b8);
sef> 	r2 |= 8;
sef> 	outb (0x3b8, r2);

sef> (r0 through r7 are declared locally as 'unsigned long r0, r1, ...;', and
sef> inb and outb are declared as 'static inline unsigned char ...', and written
sef> using inline assembly)

OK, I hate defending the 386, but here we shall see that the alleged
problems do not simply exist.

sef> Here is the code gcc generates for that:

sef> 	movl $952,-220(%ebp)
sef> 	movw -220(%ebp),%dx
sef> 	inb (%dx)

This is "inb(0x3b8)". If the code generator were cleverer this could be
dramatically simplified.

sef> 	movb %al,-216(%ebp)
sef> 	movzbl -216(%ebp),%eax
sef> 	movl %eax,-216(%ebp)
sef> 	movzbl -216(%ebp),%eax
sef> 	movl %eax,-212(%ebp)
sef> 	movl %eax,-216(%ebp)

This unfortunate code sequence stores the byte returned by "inb" into
"r2" and "r3". It is simply a classic example of bad code generation.
The same thing can be done in a nothing with better coding. Problem with
the code generator again, not with the 386 instruction set.

sef> 	movl -220(%ebp),%eax
sef> 	movl %eax,-220(%ebp)

This just does not make any sense. It is pure bad code generation.

sef> 	movl -216(%ebp),%eax
sef> 	orl $8,%eax
sef> 	movl %eax,-216(%ebp)

This does the "r2 |= 8". It seems reasonable to me. Surely though "r2"
could have been kept in a register, had it been so declared.

sef> 	movw -220(%ebp),%dx
sef> 	movb -216(%ebp),%al
sef> 	outb (%dx)

This does the "outb(0x3b8,r2)". Again it seems reasonable to me.  The
compiler however does not recognize that there is no need to reload "r2"
into "al", because it is already there; the same is true for "0x3b8" in
"dx". Frankly both "r3" and "r2" could have been kept in registers. Try
using the AT&T 'cc' and using "register" storage class.

For example a good code generator could rewrite this as:


	/ %ebx caches r3, %ecx caches r2, %edx caches 0x358

	movl $952,%edx
	xor %eax,%eax
 	inb (%dx)
	movl %eax,%ebx
	movl %eax,%ecx
	orl $8,%ecx
	movl %ecx,%eax
	outb (%dx)

Code of this sort is attainable with the AT&T rcc for the 386 (I will
try it at home as soon as I get back). I posted once a CoreCopy()
function, which is highly time critical. I have looked at the code
generated by the rcc on the 386 for it, and it seems pretty OK to me.

The problems above do not stem from a lack of registers, but rather from
the fact that unfortunately the 386 code generator for GCC does a pretty
poor job. This is because of two reasons: the GCC code generator tables
for the 386 are not well tuned and the GCC code generation scheme is ill
suited to the 386.

If I were to say what are the problems with the 386 architecture, lack
of registers does not feature prominently. I would rather say the
awkwardness of using three lengths of integers (the 386 supports easily
only either 8/16 or 8/32 bits integer modes) and the awkward way
floating point registers must be managed (both GCC and the AT&T rcc do a
poor job here).

The one case where one would really need more registers on the 386 is
when running an interpreter, to keep into registers the state of the
interpreted machine. This is still sort of "vertical" multithreading
though, in that one essentially wants one set of registers for the
interpreter program, and one for the interpreted program.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk