Path: utzoo!attcan!uunet!decwrl!bacchus.pa.dec.com!jmd
From: jmd@decwrl.dec.com (John Danskin)
Newsgroups: comp.arch
Subject: Re: Compiler Costs
Message-ID: <JMD.90Jul26130918@elf5.stars.flab.Fujitsu.JUNET>
Date: 26 Jul 90 17:09:18 GMT
References: <40052@mips.mips.COM> <628@dg.dg.com> <64045@sgi.sgi.com> <640@dg.dg.com>
Sender: news@wrl.dec.com (News)
Organization: DEC Advanced Technology Development, Palo Alto, CA
Lines: 110


I just wanted to put my two cents in on this.  A lot of people make
generalizations based on the kind of coding they do, and state that
the kind of coding other people are paid to do is stupid.

I work on low level 3D graphics for digital equipment in palo alto.
We build medium end 3D workstations (whatever those are). Somewhere in
the workstation, there is alwoays a goemetry accelerator (if there
isn't, some other group does the work). So far, for us, the geometry
accelerator is some kind of RISC chip with hot floating point and a
reasonably good C compiler. We have used weitek Accell parts and the
Intel i860.

I have been the person who examines chips, and says how fast they will
go, so the people who decide how fast we need to go and how much we
want to pay for a chip can make up their minds.

To do this, I spend a few weeks writing machine/assembler code for the
chip for the graphics primitives we think are interesting, and count
the bus transactions, and talk to the chip company and the designers
etc.  Eventually, I have some pretty good pieces of code that
represent a speed of light for the chip and the proposed memory
subsystem and communications bottleneck. This might take a few weeks.
We evaluate a lot more chips than we build systems out of.

When we finally decide to build a system, we start out by coding
everything in C (or copying the old C code from the last project). It
runs with simulated hardware and communications at a fairly hokey
level for a while while we wait for prototype real hardware.
Eventually, we get real hardware, port to it, and the stuff runs in C.
With pretty good C at this point (the algorithms are straightforward,
well known, and we know them by now) the code runs between 10 and 30
times slower than the speed of light for the configuration, which is
of course, what we are committed to achieving. The next 90% of the schedule
is budgeted to optimization. We achieve this improvement by:

	recoding in assembler. Some routines are recoded several times with
	slightly different tradeoffs and degrees of aggresiveness in
	loop unrolling and inlining. This is extremely expensive.
	
	Incidentally, we do start with a few really important expensive
	routines, but those unimportant data shuffling, branching,
	procedure calling routines rapidly get important as we run out
	of things to do to those nifty little floating point routines.

	Sometimes we just recode in C a few times until we hit on a way to
	get the compiler to generate reasonable code.

	We seem to get about 5X performance out of this phase.

	Other things. I wrote a cache optimizing linker. Using a call
	graph and some human intervention (a hand generated sed script),
	It reorders modules so that there will be no cache misses
	in a direct mapped cache (it knows the size) for code paths
	that we think are important. Usually these are the paths for
	the graphics primitives with some of the infrequent stuff edited
	out of the call graph, and things that cannot happen in the same
	primitive flagged as OK to overlap in the cache.

	On the system I developed this for, we got a 2X performance increase.
	(Before, we had about 90% hits with a penalty for a miss of 5 cycles,
	since one miss implies another if you are thrashing on a direct cache,
	this means the penalty was really 10 cycles.) 10% of operations taking
	10 times as long means the whole thing takes twice as long.

	What clued us in to this cache problem, was a routine that we recoded in
	assembler. We brought this critical routine from 400 cycles to 40 cycles,
	but the code ran slightly slower. We looked at the link map, and sure enough,
	the resize had caused it to overlap with this other critical routine...
	To recoding in assembly CAN slow down your code, but usually there is a
	lesson more interesting than "Assembly coding is bad".

	Usually we improve the hardware a little bit as we have better
	data on which parts *really* are bottlenecks.


Why do we do this?
	Basically because we are a hardware company, and people
	view the part of the system which we produce as part of the
	hardware, and because it seems to be cheaper than including
	10X faster hardware on every system and charging the same.

	People do full custom VLSI too. If everybody refused to get
	their hands dirty and look at some bits, nobody would build
	hardware ( a much lower level process than assembly or
	microcoding ).

	The development costs are staggering. The maintanence costs
	are staggering, but still it is all quite a bit cheaper than
	10000 copies of 10X faster hardware.


	Another reason is because our code is floating point intensive,
	but not exactly vectorizable, and the compilers (MIPS included)
	do a relatively poor job of scheduling around multicycle instructions.


	The final reason, is that if we didn't do it, nobody would buy
	our boxes, and I would have to get another job, and I *like*
	working here.


Incidentally, my favorite language this year is scheme. It doesn't
run as fast as C, but sometimes you need to implement programs fast
instead of implementing fast programs.
--
John Danskin				| jmd@decwrl.dec.com  or decwrl!jmd
DEC Advanced Technology Development	| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94301			| I do not speak for DEC.