Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!purdue!ames!lll-winken!vette!brooks
From: brooks@vette.llnl.gov (Eugene Brooks)
Newsgroups: gnu.gcc
Subject: Re: Porting gcc to the new Sun SPARCstation 1 and SPARCstation 300 series
Message-ID: <24614@lll-winken.LLNL.GOV>
Date: 4 May 89 05:14:25 GMT
References: <8905030323.AA27122@jato.Jpl.Nasa.Gov> <5441@cs.utexas.edu> <GRUNWALD.89May3093539@flute.cs.uiuc.edu>
Sender: usenet@lll-winken.LLNL.GOV
Reply-To: brooks@maddog.llnl.gov (Eugene Brooks)
Distribution: gnu
Organization: Lawrence Livermore National Laboratory
Lines: 46

In article <GRUNWALD.89May3093539@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>
>I've been curious about the talk of scheduling code mentioned here and
>elsewhere.
>
>Other than the possibility of schedules affecting register allocation,
>is there any reason to do schedules in the compiler? Would this be
>better supported by the assembler and/or an object-code-to-object-code
>scheduler?

The key problem for a "post processor" scheduler is the compiler reusing
a register in a way which prevents efficient scheduling.  For instance,
GCC will, given the following code and a load store architecture
	a = b + c;
	d = e + f;
will emit something along the lines of
	load r0,c
	load r1,b
	add r0,r1
	store r0,a
	load r0,f
	load r1,e
	add r0,r1
	store r0,d
which you have a hard time getting much from a scheduler due to the
resource conflicts generated by the register allocation.

We faced this problem for a simulated load store architecture which had
quite a few registers, using a PCC based compiler, and solved it by
changing the scratch register allocator to run "round robin" around the
available registers when looking for one.  I doubt it is "optimal" but
the simple heuristic works very well in practice.  A post processing optimizer
now does not run into the conflict problem above, and in fact can be taught
to do all kinds of clever common subexpression and redundant load removals
easily with simple pattern matching.  For our simulated architecture the
operations were of the form op dest,op1,op2 which did not destroy register
contents.  This was quite useful in improving optimizer effectiveness.

I have checked that you can suitably change the register allocator for GCC.
You do not need many registers for the "round robin" trick to work well,
16 scratch registers shows a good effect and 32 is more than you need for
typical code.  Of course, unrolled loops and long pipeline latencies could
use LOTS of registers, but that is what VECTOR registers are for.


brooks@maddog.llnl.gov, brooks@maddog.uucp