Path: utzoo!attcan!uunet!cs.utexas.edu!rice!titan.rice.edu!preston
From: preston@titan.rice.edu (Preston Briggs)
Newsgroups: comp.arch
Subject: Re: Compilers taking advantage of architectural enhancements
Message-ID: <1990Oct11.223224.26604@rice.edu>
Date: 11 Oct 90 22:32:24 GMT
References: <1990Oct9> <3300194@m.cs.uiuc.edu> <AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
Lines: 95

In article <AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:

>Perhaps we can start
>a discussion that will lead to a list of possible hardware
>architectural enhancements that a compiler can/cannot take advantage of?

I'll comment on a couple of features from the list

>Register file - large (around 128 registers, or more)
>    Most compilers do not get enough benefit from these to justify
>    the extra hardware, or the slowed down register access.

In the proceedings of Sigplan 90, there's a paper about how to chew
lots of registers.

	Improving Register Allocation for Subscripted Variables
	Callahan, Carr, Kennedy

I suggested the subtitle "How to use of all those FP registers"
but nobody was impressed.  Also, there's a limit to how many registers
you need, at least for scientific fortran.  It depends on the speed
of memory and cache, speed of the FPU, and the actual applications.
The idea is that once the FPU is running at full speed,
more registers are wasted.

>Heterogenous register file
>    Few compilers have been developed that can take advantage of a
>    truly heterogenous register file, one in which, for example, the
>    divide unit writes to registers D1..D2, the add unit writes to
>    registers A1..A16, the shift unit writes to registers S1..S4 -- 
>    even though such hardware conceivably has a cycle time advantage
>    over homogenous registers, even on VLIW machines where data can
>    easily be moved to generic registers when necessary.
>    DIFFICULTY: hard. 

At first glance, the problem seems susceptable to coloring.
Perhaps I'm missing something.


>Data cache - software managed consistency
>    This reportedly has been done, but certainly isn't run-of-the-mill.
>    DIFFICULTY: needs a skilled compiler expert.

At Rice (and other places), people are considering the perhaps
related problems of trying to manage cache usage for a single
processor.  I'm personally turned on by the topic because of
big performance gains possible and the possible impact on 
architecture.  Questions like: Can we get away with no D-cache?
Perhaps we don't need cache for FP only?
Can we get away with only direct mapped cache?  What does a compiler
do with set associativity?  How can we do prefetches to cache?

Porterfield did a thesis here that talks some about these questions.
Additionally, Callahan and Porterfield (both at Tera) have a paper
in Supercomputing 90 on (perhaps) similar topics.


>Multiple functional units - heterogenous - VLIW or superscalar
>    DIFFICULTY: complex.
>Multiple functional units - homogenous - VLIW or superscalar
>    DIFFICULTY: moderately complex
>    	Easier than the heterogenous case, and the packing algorithms
>    	are considerably easier.

I had never thought to distinguish the two cases and
I'm not sure why the scheduling algorithms should be much different.


>Special hardware instructions - scalar
>    Taking advantage of simple instructions like abs(), conditional 
>    exchange, etc.
>    DIFFICULTY:
>    	(1) When treated not as a compiler problem, but as a problem of simply
>    	    writing libraries to inline optimized machine code, EASY
>    	    Requires inlining support.

For intrinsics, I follow the PL.8 example.
That is, have intermediate language instructions
for ABS etc. so the optimizer can try and hoist them or perhaps strength
reduce them (e.g. SIN).  Then expand to a simple form (perhaps with branches
and so forth), and let the optimizer get at the guts of each operation.
Some like ABS might be available as basic instructions and so need not
be expanded to a lower level form.  This seems to require that the 
front-end recognize certain calls as intrinsics.  Naturally, this
works fine with Fortran, but compilers for other languages could
easily adopt the same approach.  Probably have for C.

This isn't wonderfully extensible, but people have worked on
variations that might be worth exploring.  In particular,
the Experimental Compiler System (ECS) project at IBM hoped to
achieve the same effect in a more extensible fashion.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu