Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: Register Count Message-ID: Date: 15 Jan 91 22:58:51 GMT References: <11566@pt.cs.cmu.edu> <1991Jan14.163739.10786@rice.edu> Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 170 Nntp-Posting-Host: teachk In-reply-to: preston@ariel.rice.edu's message of 14 Jan 91 16:37:39 GMT On 14 Jan 91 16:37:39 GMT, preston@ariel.rice.edu (Preston Briggs) said: preston> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: [ ... register coloring has the wrong goal; it works only because there are often so many registers that *any* register allocation policy would win, by not being used ... ] preston> The idea of managing registers/memory via graph coloring is an preston> old one (at least back to '61). The basic idea is to minimize preston> the register requirements (as in "find a minimal coloring"). preston> Spilling isn't a part of graph coloring; it's a part of preston> register allocation. Well, this is a bit misleading -- graph coloring and spilling may be different modules of a compiler, but the *goal* of graph coloring is to minimize the number of spills. preston> The advantage of graph coloring is that it provides a simple preston> abstraction from the complexities of register allocation on preston> various interesting machines. The reason it works well (and it preston> _does_) is that many machines have enough registers to run most preston> programs. In other words that several modern architectures have many more registers than they would otherwise need, just to make an inappropriate compiler technology work, by never de facto exercising it. Many machines as a result have ironically enough registers that running graph coloring itself is pointless, because there is no need to minimise register usage. The problem is that registers are not free. It reminds me of the System V swapper/pager -- its policies are so bad that the only way to make it work well is to have enough memory (ten times the working set) that swapping/paging never occurs :-). preston> Lots of people complain about the difficulty of implementation preston> and suggest some simpler alternative. My experience is that a preston> graph coloring allocator is not that hard to build, I agree, the graph coloring allocator is not difficult to build -- it can be done in a few dozen lines of code. Complaints dismissed :-). But the tedious and dangerous part is to do the global analysis that builds the graph. preston> Further, every register allocator based on graph coloring that preston> I've read about (and I read aggressively in this area) uses preston> heuristics to estimate the dynamic cost of spills. In preston> addition, some use of profiling info. Spill choices are always preston> based on these dynamic estimates. Are we sure we are not thinking of the same papers, only that I read them as in "if you have heuristics/profiling graph coloring is virtually useless" and you read them as "graph coloring allocators do make use of heuristics/profiling"? :-) [ ... on Wall's papers ... ] preston> It's important to be a little sceptical though, and look for preston> holes (reasons why his measurements might not apply to preston> particular situations). My reckoning is that his papers are essentially valid for architectures like that of his test machine, where the internal degree of parallelism is small, and the applications are general purpose. More research is needed to try to see how the knee of the curve shifts along two dimensions; type of application and internal degree of parallelism of the machine. Not along the degree of "optimization" dimension; either the code generator does a good job of matching the application dataflow pattenrs to those of the CPU, and then it is simply doing competent code generation, or it is badly designed, or if the machine and application dataflow patterns are not well matched, no additional "optimization" will buy anything. But first I think more research is needed into application and machine characterization in themselves... The only serious amount of work in this field has been done by the parallel computer people, for medium and coarse level parallelism. Microparallelism has not been as widely studied. Some work, at the language level, has been done by the functional language people (not surprisingly, they often are the same ones that like dataflow and graph reduction architectures), but that's hardly mainstream. preston> For his 86 paper, I wonder about: preston> 1) The balance of his machine (cost of add versus cost of load). preston> In the 5 years since the 86 paper, spills have gotten more preston> expensive. Yes. But the question is: is a bigger flat register file the best way to provide extra buffering against increased memory latency? *Assuming* so sounds like CISC designers assuming that ever more sophisticated instructions are the key to greater speed. preston> Consider also various forms of low-level parallelism preston> (pipelines and such). These aren't recent inventions, but preston> they change the shape of the code. They are important issues -- and I seem to have written that the number of useful registers is a more or less linear function of the exploitable degree of parallelism, for spare registers, and of the frequency of reuse of values, for cache registers. For contemporary microcomputer implementations, with low limits on both, this means that far less registers are needed than "optimizer" mythology assumes. For future implementations, which are supposed to have much greater internal degrees of parallelism, more registers will be needed. But high degrees of parallelism imply well predictable and independent dataflow patterns -- separate register sets seem the best bet here, as SIMD/MIMD architectures for even higher degress of parallelism. NOTE: I have *never* written that there is an *absolute* upper limit to the number of useful registers for *all* possible applications or machine architectures. I have nothing to complain about the size of the huge vector register banks of a CRAY. preston> 2) His optimizer. It's not mentioned beyond "The compiler generates preston> good code that does not load a variable twice in the same basic preston> block..." Sounds like value numbering to me. Optimization needs preston> registers. More powerful optimization needs more registers (and preston> the payoff is faster programs). This is an *assumption*, and a bizarre one. "Optimization" does not create the need for more registers -- the need for registers is strictly function of the intrinsic microparallelism of the application and machine, as evinced from their dataflow patterns. These, for general purpose applications on contemporary general purpose microprocessors, imply fairly low register requirements. "Optimization" cannot get around this. I have detected, also in e-mail discussions, that there is some misudenrstanding of the intention of my postings: My postings sneer at the "more registers!" and "more optimization!" fashions, by pointing out that: * optimal number of registers is a functions of machine and application dataflow patterns, not of "optimization"; * there is some evidence that leads to believe that such optimal number is suprinsingly low for contemporary processors with limited internal parallelism and not that much greater levels of parallelism are exploitable, for general purpose applications; * for non general purpose applications SIMD/MIMD machines and their special rules are the best bet; * much of global analysis that makes "optimization" hard and hazardous can be obviated by a little bit more thoughtfulness at the language level, such as using language paradigms of appropriate expressive power, instead of reverse engineering in the "optimizer" poor choices made at higher layers; * just to make my criticism constructive, I hazard showing that credible (in line of principle) specific alternatives are possible, so that my call for more research seem more alluring; * perversely, a lot of research effort is spent instead on "more registers!" and "more optimization!", comparatively little (as published) on characterization of low level dataflow patterns and on careful language design. Dusty decks (or ideas) rule? The objections I see to my articles seem to be totally off the mark with respect to the above points. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk