Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Register Count
Message-ID: <PCG.91Jan15225851@teachk.cs.aber.ac.uk>
Date: 15 Jan 91 22:58:51 GMT
References: <PCG.91Jan10162301@odin.cs.aber.ac.uk> <11566@pt.cs.cmu.edu>
	<1991Jan14.163739.10786@rice.edu>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 170
Nntp-Posting-Host: teachk
In-reply-to: preston@ariel.rice.edu's message of 14 Jan 91 16:37:39 GMT

On 14 Jan 91 16:37:39 GMT, preston@ariel.rice.edu (Preston Briggs) said:

preston> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

	[ ... register coloring has the wrong goal; it works only
	because there are often so many registers that *any* register
	allocation policy would win, by not being used ... ]

preston> The idea of managing registers/memory via graph coloring is an
preston> old one (at least back to '61).  The basic idea is to minimize
preston> the register requirements (as in "find a minimal coloring").
preston> Spilling isn't a part of graph coloring; it's a part of
preston> register allocation.

Well, this is a bit misleading -- graph coloring and spilling may be
different modules of a compiler, but the *goal* of graph coloring is to
minimize the number of spills.

preston> The advantage of graph coloring is that it provides a simple
preston> abstraction from the complexities of register allocation on
preston> various interesting machines.  The reason it works well (and it
preston> _does_) is that many machines have enough registers to run most
preston> programs.

In other words that several modern architectures have many more
registers than they would otherwise need, just to make an inappropriate
compiler technology work, by never de facto exercising it.

Many machines as a result have ironically enough registers that running
graph coloring itself is pointless, because there is no need to minimise
register usage. The problem is that registers are not free.

It reminds me of the System V swapper/pager -- its policies are so bad
that the only way to make it work well is to have enough memory (ten
times the working set) that swapping/paging never occurs :-).

preston> Lots of people complain about the difficulty of implementation
preston> and suggest some simpler alternative.  My experience is that a
preston> graph coloring allocator is not that hard to build,

I agree, the graph coloring allocator is not difficult to build -- it
can be done in a few dozen lines of code. Complaints dismissed :-). But
the tedious and dangerous part is to do the global analysis that builds
the graph.

preston> Further, every register allocator based on graph coloring that
preston> I've read about (and I read aggressively in this area) uses
preston> heuristics to estimate the dynamic cost of spills.  In
preston> addition, some use of profiling info.  Spill choices are always
preston> based on these dynamic estimates.

Are we sure we are not thinking of the same papers, only that I read
them as in "if you have heuristics/profiling graph coloring is virtually
useless" and you read them as "graph coloring allocators do make use of
heuristics/profiling"? :-)

	[ ... on Wall's papers ... ]

preston> It's important to be a little sceptical though, and look for
preston> holes (reasons why his measurements might not apply to
preston> particular situations).

My reckoning is that his papers are essentially valid for architectures
like that of his test machine, where the internal degree of parallelism
is small, and the applications are general purpose. More research is
needed to try to see how the knee of the curve shifts along two
dimensions; type of application and internal degree of parallelism of
the machine.

Not along the degree of "optimization" dimension; either the code
generator does a good job of matching the application dataflow pattenrs
to those of the CPU, and then it is simply doing competent code
generation, or it is badly designed, or if the machine and application
dataflow patterns are not well matched, no additional "optimization"
will buy anything.

But first I think more research is needed into application and machine
characterization in themselves... The only serious amount of work in
this field has been done by the parallel computer people, for medium and
coarse level parallelism. Microparallelism has not been as widely
studied. Some work, at the language level, has been done by the
functional language people (not surprisingly, they often are the same
ones that like dataflow and graph reduction architectures), but that's
hardly mainstream.

preston> For his 86 paper, I wonder about:
preston> 1) The balance of his machine (cost of add versus cost of load).
preston>    In the 5 years since the 86 paper, spills have gotten more
preston>    expensive.

Yes. But the question is: is a bigger flat register file the best way to
provide extra buffering against increased memory latency? *Assuming* so
sounds like CISC designers assuming that ever more sophisticated
instructions are the key to greater speed.

preston>    Consider also various forms of low-level parallelism
preston>    (pipelines and such).  These aren't recent inventions, but
preston>    they change the shape of the code.

They are important issues -- and I seem to have written that the number
of useful registers is a more or less linear function of the exploitable
degree of parallelism, for spare registers, and of the frequency of
reuse of values, for cache registers.

For contemporary microcomputer implementations, with low limits on both,
this means that far less registers are needed than "optimizer" mythology
assumes. For future implementations, which are supposed to have much
greater internal degrees of parallelism, more registers will be needed.

But high degrees of parallelism imply well predictable and independent
dataflow patterns -- separate register sets seem the best bet here, as
SIMD/MIMD architectures for even higher degress of parallelism.

NOTE: I have *never* written that there is an *absolute* upper limit to
the number of useful registers for *all* possible applications or
machine architectures. I have nothing to complain about the size of the
huge vector register banks of a CRAY.

preston> 2) His optimizer.  It's not mentioned beyond "The compiler generates
preston>    good code that does not load a variable twice in the same basic
preston>    block..."  Sounds like value numbering to me. Optimization needs
preston>    registers.  More powerful optimization needs more registers (and
preston>    the payoff is faster programs).

This is an *assumption*, and a bizarre one. "Optimization" does not
create the need for more registers -- the need for registers is strictly
function of the intrinsic microparallelism of the application and
machine, as evinced from their dataflow patterns. These, for general
purpose applications on contemporary general purpose microprocessors,
imply fairly low register requirements. "Optimization" cannot get around
this.

I have detected, also in e-mail discussions, that there is some
misudenrstanding of the intention of my postings:

My postings sneer at the "more registers!" and "more optimization!"
fashions, by pointing out that:

* optimal number of registers is a functions of machine and application
dataflow patterns, not of "optimization";

* there is some evidence that leads to believe that such optimal number
is suprinsingly low for contemporary processors with limited internal
parallelism and not that much greater levels of parallelism are
exploitable, for general purpose applications;

* for non general purpose applications SIMD/MIMD machines and their
special rules are the best bet;

* much of global analysis that makes "optimization" hard and hazardous
can be obviated by a little bit more thoughtfulness at the language
level, such as using language paradigms of appropriate expressive power,
instead of reverse engineering in the "optimizer" poor choices made at
higher layers;

* just to make my criticism constructive, I hazard showing that credible
(in line of principle) specific alternatives are possible, so that my
call for more research seem more alluring;

* perversely, a lot of research effort is spent instead on "more
registers!" and "more optimization!", comparatively little (as
published) on characterization of low level dataflow patterns and on
careful language design.  Dusty decks (or ideas) rule?

The objections I see to my articles seem to be totally off the mark with
respect to the above points.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk