Path: utzoo!attcan!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Compilers taking advantage of architectural enhancements
Message-ID: <PCG.90Oct14180009@odin.cs.aber.ac.uk>
Date: 14 Oct 90 18:00:09 GMT
References: <1990Oct9> <3300194@m.cs.uiuc.edu>
	<AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu>
	<1990Oct11.223224.26604@rice.edu>
	<AGLEW.90Oct11222801@treflan.crhc.uiuc.edu>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 124
Nntp-Posting-Host: odin
In-reply-to: aglew@crhc.uiuc.edu's message of 12 Oct 90 03:28:01 GMT

On 12 Oct 90 03:28:01 GMT, aglew@crhc.uiuc.edu (Andy Glew) said:

	[ ... some comments on large numbers of registers being useful
	... ]

Bah, as usual. If you use them as static cache, yes. But isn't a dynamic
cache as good and less trouble? Yes, if you don't use a load-store
architecture.

In a load-store architecture to address a line in the cache takes an
extra load or store instruction, and potentially a delay slot;
addressing a line in the register bank takes just a wide register number
field in the current instruction. Note that some compilers are starting
to treat cache lines as registers indeed, by scheduling code to have
optimal cache reference patterns.

So, given that one wants an intermediate cache between the input and
output ports of the CPU functional units, and the main memory, we could
have three alternatives:

1) a dynamic cache addressed with aliases of main memory addresses.

2) a static cache in a separate, much smaller, address space.

3) a cache with multiple stacks, only top of stacks have addresses.

If your architecture can address "efficiently" the main memory address
space, 1) is better than 2); if it cannot, 2) is better than 1); in all
cases MNHO 3) is better than either 1) or 2), because it is dynamic
just like 1) and does not require long addresses just like 2).

aglew> I agree with you --- I really don't understand why heterogenous
aglew> register files are so hard to handle.  But homogenous register
aglew> files are one thing that compiler people have gone into
aglew> rhapsodies wrt. RISC about.

That's actually not difficult to comprehend, IMHNO, as soon as you
realize that registers, as currently (mis)understood, perform actually
two completely different functions (at least -- there are others):

1) inputs and outputs to functional units ("accumulators")

2) statically managed cache ("temporaries").

The former function of registers means that they are essentially entry
and exit ports of a queueing network. In order to generate efficient
code for a queueing network you must analyze flows into it, or something
equivalent. This seems harder than just considering problem 2), which is
already hard enough.

aglew> Here's one example: the Intel 80x86 is basically a heterogenous
aglew> register file machine. Specific registers were tied to the
aglew> outputs and inputs of specific functional units in the original
aglew> hardware.  Compiler people hated targetting this architecture,
aglew> and there are very few compilers that can produce machine code
aglew> comparable to hand-coded assembly on this architecture.

Oh yes. But this is simply because current compiler technology is mostly
based on believing that registers are there only to be a statically
managed cache. Thus ridiculous things like graph coloring, which
minimizes the *static* costs, e.g. code size, not run time, unless there
are so many registers that essentially all values, including those that
are dynamically important, have a chance of ending up in a register.
There are plenty of research papers that show that

1) the number of dynamically important values is very small, for a
single functional unit.

2) large numbers of registers are useful under graph coloring and
on machines that have a huge gap between register file and cache.

The two sets of results can only be reconciled by observing that:

* vector/superscalar etc... have in effect multiple functional units

* graph coloring wastes a large amount of registers to dynamically
unimportant values, and load/store architectures have a huge gap between
register file and cache.

Not surprising, eh?

I reckon that fully specialized registers (e.g. having input-only and
output-only registers that map directly onto functional unit ports) are
best, and that caching temporaries ought not to be done with registers.

I would like a more data-flow like architecture, in which the input and
output ports of the functional units (and the relative delays maybe) are
directly exposed, and separate.

Caching, IMNHO, ought to be performed using multiple cached stacks, or
anyhow using dynamic caching (e.g. like in the i386/i486, where the
onchip cache is almost a large associative register bank).

Naturally exposing the functional units and their input and output ports
(hints of VLIW here) means that the number of architecturally visible
ports varies with the number of functional units in different
implementations. This is a problem anyhow; one can solve it in several
ways, e.g.:

0) recompiling for different implementations
1) lengthening of the instruction word (VLIW)
2) register renaming (RS/6000)
3) dynamic queuing (MU dataflow)

You may argue that 0) is not a solution; but consider: it is probably
the best way to take advantage of the specificities of a particular
implementation. 1) is a slightly easier way of doing 0). 2) ensures
binary portability, but I don't see how it could work over a large range
of functional unit numbers. 3) is guaranteed to exploit any number of
functional units nearly optimally, but requires sophisticated hardware.

aglew> But heterogenous register files are much easier to make fast.

Because you do not have to put logic in that does the mapping from the
register file as static cache to the input-output ports of the
functional units, if you choose one of solutions 0-2) above. Arguably
solution 3) is so flexible that its potential complexity/speed
disadvantage can be offset by adding extra functional units, even if
there are hints that the inherent parallelism in many applications does
not require a lot of functional units (my rule of thumb is '4').
--
Piercarlo "Peter" Grandi           | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk