Path: utzoo!attcan!uunet!decwrl!bacchus.pa.dec.com!jmd From: jmd@decwrl.dec.com (John Danskin) Newsgroups: comp.arch Subject: Re: Compiler Costs Message-ID: Date: 26 Jul 90 17:09:18 GMT References: <40052@mips.mips.COM> <628@dg.dg.com> <64045@sgi.sgi.com> <640@dg.dg.com> Sender: news@wrl.dec.com (News) Organization: DEC Advanced Technology Development, Palo Alto, CA Lines: 110 I just wanted to put my two cents in on this. A lot of people make generalizations based on the kind of coding they do, and state that the kind of coding other people are paid to do is stupid. I work on low level 3D graphics for digital equipment in palo alto. We build medium end 3D workstations (whatever those are). Somewhere in the workstation, there is alwoays a goemetry accelerator (if there isn't, some other group does the work). So far, for us, the geometry accelerator is some kind of RISC chip with hot floating point and a reasonably good C compiler. We have used weitek Accell parts and the Intel i860. I have been the person who examines chips, and says how fast they will go, so the people who decide how fast we need to go and how much we want to pay for a chip can make up their minds. To do this, I spend a few weeks writing machine/assembler code for the chip for the graphics primitives we think are interesting, and count the bus transactions, and talk to the chip company and the designers etc. Eventually, I have some pretty good pieces of code that represent a speed of light for the chip and the proposed memory subsystem and communications bottleneck. This might take a few weeks. We evaluate a lot more chips than we build systems out of. When we finally decide to build a system, we start out by coding everything in C (or copying the old C code from the last project). It runs with simulated hardware and communications at a fairly hokey level for a while while we wait for prototype real hardware. Eventually, we get real hardware, port to it, and the stuff runs in C. With pretty good C at this point (the algorithms are straightforward, well known, and we know them by now) the code runs between 10 and 30 times slower than the speed of light for the configuration, which is of course, what we are committed to achieving. The next 90% of the schedule is budgeted to optimization. We achieve this improvement by: recoding in assembler. Some routines are recoded several times with slightly different tradeoffs and degrees of aggresiveness in loop unrolling and inlining. This is extremely expensive. Incidentally, we do start with a few really important expensive routines, but those unimportant data shuffling, branching, procedure calling routines rapidly get important as we run out of things to do to those nifty little floating point routines. Sometimes we just recode in C a few times until we hit on a way to get the compiler to generate reasonable code. We seem to get about 5X performance out of this phase. Other things. I wrote a cache optimizing linker. Using a call graph and some human intervention (a hand generated sed script), It reorders modules so that there will be no cache misses in a direct mapped cache (it knows the size) for code paths that we think are important. Usually these are the paths for the graphics primitives with some of the infrequent stuff edited out of the call graph, and things that cannot happen in the same primitive flagged as OK to overlap in the cache. On the system I developed this for, we got a 2X performance increase. (Before, we had about 90% hits with a penalty for a miss of 5 cycles, since one miss implies another if you are thrashing on a direct cache, this means the penalty was really 10 cycles.) 10% of operations taking 10 times as long means the whole thing takes twice as long. What clued us in to this cache problem, was a routine that we recoded in assembler. We brought this critical routine from 400 cycles to 40 cycles, but the code ran slightly slower. We looked at the link map, and sure enough, the resize had caused it to overlap with this other critical routine... To recoding in assembly CAN slow down your code, but usually there is a lesson more interesting than "Assembly coding is bad". Usually we improve the hardware a little bit as we have better data on which parts *really* are bottlenecks. Why do we do this? Basically because we are a hardware company, and people view the part of the system which we produce as part of the hardware, and because it seems to be cheaper than including 10X faster hardware on every system and charging the same. People do full custom VLSI too. If everybody refused to get their hands dirty and look at some bits, nobody would build hardware ( a much lower level process than assembly or microcoding ). The development costs are staggering. The maintanence costs are staggering, but still it is all quite a bit cheaper than 10000 copies of 10X faster hardware. Another reason is because our code is floating point intensive, but not exactly vectorizable, and the compilers (MIPS included) do a relatively poor job of scheduling around multicycle instructions. The final reason, is that if we didn't do it, nobody would buy our boxes, and I would have to get another job, and I *like* working here. Incidentally, my favorite language this year is scheme. It doesn't run as fast as C, but sometimes you need to implement programs fast instead of implementing fast programs. -- John Danskin | jmd@decwrl.dec.com or decwrl!jmd DEC Advanced Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94301 | I do not speak for DEC.