Path: utzoo!attcan!uunet!cs.utexas.edu!rice!titan.rice.edu!preston From: preston@titan.rice.edu (Preston Briggs) Newsgroups: comp.arch Subject: Re: Compilers taking advantage of architectural enhancements Message-ID: <1990Oct11.223224.26604@rice.edu> Date: 11 Oct 90 22:32:24 GMT References: <1990Oct9> <3300194@m.cs.uiuc.edu> Sender: news@rice.edu (News) Organization: Rice University, Houston Lines: 95 In article aglew@crhc.uiuc.edu (Andy Glew) writes: >Perhaps we can start >a discussion that will lead to a list of possible hardware >architectural enhancements that a compiler can/cannot take advantage of? I'll comment on a couple of features from the list >Register file - large (around 128 registers, or more) > Most compilers do not get enough benefit from these to justify > the extra hardware, or the slowed down register access. In the proceedings of Sigplan 90, there's a paper about how to chew lots of registers. Improving Register Allocation for Subscripted Variables Callahan, Carr, Kennedy I suggested the subtitle "How to use of all those FP registers" but nobody was impressed. Also, there's a limit to how many registers you need, at least for scientific fortran. It depends on the speed of memory and cache, speed of the FPU, and the actual applications. The idea is that once the FPU is running at full speed, more registers are wasted. >Heterogenous register file > Few compilers have been developed that can take advantage of a > truly heterogenous register file, one in which, for example, the > divide unit writes to registers D1..D2, the add unit writes to > registers A1..A16, the shift unit writes to registers S1..S4 -- > even though such hardware conceivably has a cycle time advantage > over homogenous registers, even on VLIW machines where data can > easily be moved to generic registers when necessary. > DIFFICULTY: hard. At first glance, the problem seems susceptable to coloring. Perhaps I'm missing something. >Data cache - software managed consistency > This reportedly has been done, but certainly isn't run-of-the-mill. > DIFFICULTY: needs a skilled compiler expert. At Rice (and other places), people are considering the perhaps related problems of trying to manage cache usage for a single processor. I'm personally turned on by the topic because of big performance gains possible and the possible impact on architecture. Questions like: Can we get away with no D-cache? Perhaps we don't need cache for FP only? Can we get away with only direct mapped cache? What does a compiler do with set associativity? How can we do prefetches to cache? Porterfield did a thesis here that talks some about these questions. Additionally, Callahan and Porterfield (both at Tera) have a paper in Supercomputing 90 on (perhaps) similar topics. >Multiple functional units - heterogenous - VLIW or superscalar > DIFFICULTY: complex. >Multiple functional units - homogenous - VLIW or superscalar > DIFFICULTY: moderately complex > Easier than the heterogenous case, and the packing algorithms > are considerably easier. I had never thought to distinguish the two cases and I'm not sure why the scheduling algorithms should be much different. >Special hardware instructions - scalar > Taking advantage of simple instructions like abs(), conditional > exchange, etc. > DIFFICULTY: > (1) When treated not as a compiler problem, but as a problem of simply > writing libraries to inline optimized machine code, EASY > Requires inlining support. For intrinsics, I follow the PL.8 example. That is, have intermediate language instructions for ABS etc. so the optimizer can try and hoist them or perhaps strength reduce them (e.g. SIN). Then expand to a simple form (perhaps with branches and so forth), and let the optimizer get at the guts of each operation. Some like ABS might be available as basic instructions and so need not be expanded to a lower level form. This seems to require that the front-end recognize certain calls as intrinsics. Naturally, this works fine with Fortran, but compilers for other languages could easily adopt the same approach. Probably have for C. This isn't wonderfully extensible, but people have worked on variations that might be worth exploring. In particular, the Experimental Compiler System (ECS) project at IBM hoped to achieve the same effect in a more extensible fashion. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu