Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!purdue!ames!lll-winken!vette!brooks From: brooks@vette.llnl.gov (Eugene Brooks) Newsgroups: gnu.gcc Subject: Re: Porting gcc to the new Sun SPARCstation 1 and SPARCstation 300 series Message-ID: <24614@lll-winken.LLNL.GOV> Date: 4 May 89 05:14:25 GMT References: <8905030323.AA27122@jato.Jpl.Nasa.Gov> <5441@cs.utexas.edu> Sender: usenet@lll-winken.LLNL.GOV Reply-To: brooks@maddog.llnl.gov (Eugene Brooks) Distribution: gnu Organization: Lawrence Livermore National Laboratory Lines: 46 In article grunwald@flute.cs.uiuc.edu writes: > >I've been curious about the talk of scheduling code mentioned here and >elsewhere. > >Other than the possibility of schedules affecting register allocation, >is there any reason to do schedules in the compiler? Would this be >better supported by the assembler and/or an object-code-to-object-code >scheduler? The key problem for a "post processor" scheduler is the compiler reusing a register in a way which prevents efficient scheduling. For instance, GCC will, given the following code and a load store architecture a = b + c; d = e + f; will emit something along the lines of load r0,c load r1,b add r0,r1 store r0,a load r0,f load r1,e add r0,r1 store r0,d which you have a hard time getting much from a scheduler due to the resource conflicts generated by the register allocation. We faced this problem for a simulated load store architecture which had quite a few registers, using a PCC based compiler, and solved it by changing the scratch register allocator to run "round robin" around the available registers when looking for one. I doubt it is "optimal" but the simple heuristic works very well in practice. A post processing optimizer now does not run into the conflict problem above, and in fact can be taught to do all kinds of clever common subexpression and redundant load removals easily with simple pattern matching. For our simulated architecture the operations were of the form op dest,op1,op2 which did not destroy register contents. This was quite useful in improving optimizer effectiveness. I have checked that you can suitably change the register allocator for GCC. You do not need many registers for the "round robin" trick to work well, 16 scratch registers shows a good effect and 32 is more than you need for typical code. Of course, unrolled loops and long pipeline latencies could use LOTS of registers, but that is what VECTOR registers are for. brooks@maddog.llnl.gov, brooks@maddog.uucp