Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: The old, tired tirade on "more registers" Message-ID: Date: 25 Dec 90 21:50:14 GMT References: <3058@crdos1.crd.ge.COM> <1990Dec19.052338.3911@kithrup.COM> <3068@crdos1.crd.ge.COM> <1990Dec19.223934.1568@kithrup.COM> Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 200 Nntp-Posting-Host: teachk In-reply-to: sef@kithrup.COM's message of 19 Dec 90 22:39:34 GMT X-Old-Subject: Re: Let's pretend Posting-Front-End: GNU Emacs 18.55.4 of Thu Nov 23 1989 on athene (berkeley-unix) On 19 Dec 90 22:39:34 GMT, sef@kithrup.COM (Sean Eric Fagan) said: sef> In article <3068@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com sef> (bill davidsen) writes: sef> I meant what I said - "quantify" rather than qualify. Yes optimization sef> would be better and memory accesses would be down, but how much? Not a lot... We both know that. sef> Well... I could suggest you go read any recent (last decade or so) sef> papers on compiler optimization techniques, which would be chock sef> full of them. This is an old diatribe. What I find in the literature is that (for single threaded architectures) the average "integer" procedure hardly requires more than 4 spare integer registers, and the average "floating point" procedure hardly requires more than 4 spare floating point registers, as the number of values being frequently referenced at any one time is usually smaller than that. More registers _can_ be used, but the values so cached are usually accessed infrequently enough that it does not matter a lot as, once you are past the knee of the curve, frequency of use drops precipitously. If you have multiple functional units (superscalar, vector, VLIW, deep pipelining) multiply by the number of functional units that you can keep busy at any one time (again, for general purpose codes, the knee of the curve here is 4), as each functional unit can make use of what is essentially a separate register set. Please do not reply with the standard tired, insignificant objections like "what about loop unrolling, it eats registers fast!". Loop unrolling is not a goal in itself -- it should be done only when useful, and it is useful only when there are multiple functional units that can work in parallel on different stages of the unrollment. Ergo, back to the multiple functional unit case, in which case you indeed need what are in effect multiple register sets, even if they are addressible in the same bank. sef> Also read papers on the RISC chips, and why the register and sef> instruction sets were chosen. Reading these papers carefully reveals the two reasons for which more registers heve been found to be "useful" in (single thread) RISC architectures: The first is that since what are currently called RISCs have a load-store architecture, the cost of accessing operands not in registers (a "miss" in the statically managed cache which is a register set) is much higher than for CISC machines (for example the 486 has a cache that essentially doubles as large register set as it is just one cycle away from the main register file), because one needs two instructions to do it (and the load is likely to stall, because its slot is going to be needed soon), so, to obtain the same _average_ access time to a value, one has to have better hit rates. This requires *many* more registers, because once the knee of the curve has been passed (and that knee is at 4), hit rates increase very slowly. The second is that historically (but this is becoming less and less true) fashionable RISCs have relied on "clever" compiler technology to optimize reference patterns. The problem was that such "clever" technology optimized _static_ reference patterns, by putting into registers values frequently mentioned in the program *text*, i.e. reducing code size, not _dynamic_ reference patterns, by putting into registers values frequently referenced during execution, i.e. reducing run time. Normally there is little correlation between number of static and dynamic references of a value, so in practice the best way to make static reference optimization also optimize dynamic references has been to have register sets so large that they cache all the values statically referenced in a function. In practice this means that even static reference optimization itself is usually unnecessary, because equivalent results could be obtained by just register'ing all values without any static or dynamic analysis. Papers exist to show that the number of values referenced at all in any one section of code rarely exceed 20, so a 32 element strong register set is virtually guaranteed to result in optimal caching by simply caching all values, without requiring any analysis whatsoever. It can be argued that both reasons above are *problems/misdesigns* with what currently passes for RISC, and thus are no reason to argue for large register sets. One defense however can be made; RISC wins because it is about streamlined and opportunistic implementation (getting the CPI down for frequently executed instructions) and a load-store architecture helps with faster decoding. Other alternative organizations are conceivable that also have fast decoding and keep the CPI down (multiple stacks, almost-zero-address machines). Load-store does however deliver faster decoding, even if at the price of requiring loads of otherwise unnecessary registers. "Clever" optimization is just laughable -- papers exist to demonstrate that simple heuristics (not even feedback profiling like MIPS or MultiFlow) usually predict fairly well which few values are going to be frequently referenced at run time. Now let's look at one example that you purport to show that more registers would be useful on the 386: sef> Here is a sample of code: sef> r2 = r3 = inb (0x3b8); sef> r2 |= 8; sef> outb (0x3b8, r2); sef> (r0 through r7 are declared locally as 'unsigned long r0, r1, ...;', and sef> inb and outb are declared as 'static inline unsigned char ...', and written sef> using inline assembly) OK, I hate defending the 386, but here we shall see that the alleged problems do not simply exist. sef> Here is the code gcc generates for that: sef> movl $952,-220(%ebp) sef> movw -220(%ebp),%dx sef> inb (%dx) This is "inb(0x3b8)". If the code generator were cleverer this could be dramatically simplified. sef> movb %al,-216(%ebp) sef> movzbl -216(%ebp),%eax sef> movl %eax,-216(%ebp) sef> movzbl -216(%ebp),%eax sef> movl %eax,-212(%ebp) sef> movl %eax,-216(%ebp) This unfortunate code sequence stores the byte returned by "inb" into "r2" and "r3". It is simply a classic example of bad code generation. The same thing can be done in a nothing with better coding. Problem with the code generator again, not with the 386 instruction set. sef> movl -220(%ebp),%eax sef> movl %eax,-220(%ebp) This just does not make any sense. It is pure bad code generation. sef> movl -216(%ebp),%eax sef> orl $8,%eax sef> movl %eax,-216(%ebp) This does the "r2 |= 8". It seems reasonable to me. Surely though "r2" could have been kept in a register, had it been so declared. sef> movw -220(%ebp),%dx sef> movb -216(%ebp),%al sef> outb (%dx) This does the "outb(0x3b8,r2)". Again it seems reasonable to me. The compiler however does not recognize that there is no need to reload "r2" into "al", because it is already there; the same is true for "0x3b8" in "dx". Frankly both "r3" and "r2" could have been kept in registers. Try using the AT&T 'cc' and using "register" storage class. For example a good code generator could rewrite this as: / %ebx caches r3, %ecx caches r2, %edx caches 0x358 movl $952,%edx xor %eax,%eax inb (%dx) movl %eax,%ebx movl %eax,%ecx orl $8,%ecx movl %ecx,%eax outb (%dx) Code of this sort is attainable with the AT&T rcc for the 386 (I will try it at home as soon as I get back). I posted once a CoreCopy() function, which is highly time critical. I have looked at the code generated by the rcc on the 386 for it, and it seems pretty OK to me. The problems above do not stem from a lack of registers, but rather from the fact that unfortunately the 386 code generator for GCC does a pretty poor job. This is because of two reasons: the GCC code generator tables for the 386 are not well tuned and the GCC code generation scheme is ill suited to the 386. If I were to say what are the problems with the 386 architecture, lack of registers does not feature prominently. I would rather say the awkwardness of using three lengths of integers (the 386 supports easily only either 8/16 or 8/32 bits integer modes) and the awkward way floating point registers must be managed (both GCC and the AT&T rcc do a poor job here). The one case where one would really need more registers on the 386 is when running an interpreter, to keep into registers the state of the interpreted machine. This is still sort of "vertical" multithreading though, in that one essentially wants one set of registers for the interpreter program, and one for the interpreted program. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk