Path: utzoo!attcan!uunet!know!zaphod.mps.ohio-state.edu!uwm.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew From: aglew@crhc.uiuc.edu (Andy Glew) Newsgroups: comp.arch Subject: Compilers taking advantage of architectural enhancements Message-ID: Date: 11 Oct 90 19:49:20 GMT References: <162639@<1990Oct9> <3300194@m.cs.uiuc.edu> Sender: news@ux1.cso.uiuc.edu (News) Organization: Center for Reliable and High-Performance Computing University of Illinois at Urbana Champaign Lines: 135 In-Reply-To: gillies@m.cs.uiuc.edu's message of 10 Oct 90 21:51 CDT >[gillies@m.cs.uiuc.edu] >One of the problems in new CPU designs is that designers don't realize >which architecture enhancements are pointless, because we don't and >may never have the optimization technology to take advantage of them. Ah, a potentially interesting and useful topic. Perhaps we can start a discussion that will lead to a list of possible hardware architectural enhancements that a compiler can/cannot take advantage of? Maybe the real compiler guys (preston, David C.) will help us out? By the way, it is useful in such discussions to distinguish between run-of-the-mill optimizations that can pretty much be added to any existing compiler by a fairly good CS grad, and optimizations that require so much background that only the "experts" in the field can reasonably be expected to do this. For example, many of the parallelizing FORTRAN loop optimizations can reasonably only be expected to be done by people from Rice or UIUC CSRD (note that I'm UIUC, but CRHC, not CSRD), so unless you are willing to pay for these guys (or their assorted front companies) you aren't likely to get them onto your machine. Cost of compiler development can be significant. Sometimes a company might put a hardware feature in even though they know a compiler approach would be better, because they can't afford the compiler guys (or the compiler guys already have exclusive contracts with a competitor). Let me list a few things to start off the discussion. I hope and expect to be shot down on a few of them. It's easy to list a few of the hardware enhancements that we already know compilers can take advantage of. Branch Delay Slots - small number One or two branch delay slots can be filled quite easily. DIFFICULTY: a reasonably good CS grad can do it. Branch Delay slots - large number Most groups report difficulty filling a large number of branch delay slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu) has been able to get good results for up to 8 delay slots, on slightly different hardware (forward semantic, with profiling feedback). DIFFICULTY: given the slightly different hardware, and exposure to the necessary ideas, a reasonably good CS grad can do it. Register file - moderate sized (up to 32 registers) Most compilers can use these efficiently. DIFFICULTY: a reasonably good CS grad can do it. Avoiding IBM's patents is a slight problem. Register file - large (around 128 registers, or more) Most compilers do not get enough benefit from these to justify the extra hardware, or the slowed down register access. Multiple, hierarchical, registers sets Like Cray's fast S registers and slower T registers. It took a very long time for compilers to take advantage of these multiple register sets. DIFFICULTY: complex. Separate floating point register file Most compilers can take advantage of this. DIFFICULTY: easy Heterogenous register file Few compilers have been developed that can take advantage of a truly heterogenous register file, one in which, for example, the divide unit writes to registers D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers S1..S4 -- even though such hardware conceivably has a cycle time advantage over homogenous registers, even on VLIW machines where data can easily be moved to generic registers when necessary. DIFFICULTY: hard. Instruction cache Many groups have reported compilers that can take advantage of small instruction caches by rearranging code to minimize I-cache misses. Eg. HP, UIUC IMPACT. DIFFICULTY: moderately complex - a good CS grad can do it, but it requires tools and background that take a while to develop, and are not widely available. One of the complicating factors is that this sort of optimization is most useful when it is done in cooperaton with the linker/loader. Which is something that your average compiler hacker doesn't have background in (although it isn't too hard). Data cache - software managed consistency This reportedly has been done, but certainly isn't run-of-the-mill. DIFFICULTY: needs a skilled compiler expert. Micro-scheduling parallelism (like CONVEX's ASAP) Apparently isn't too hard to develop compiler support for this. Hardware is rather expensive, though. DIFFICULTY: moderately complex Vectorizing Done in lots of places, but high performance vectorizing compilers still seem to be a problem for a lot of companies. DIFFICULTY: complex. The easy things are easy, but milking the last few FLOPs out of your vector unit requires compiler experts. Parallelizing - fine grain, small numbers of processors (like Alliant's concurrent hardware) Done, eg. by Alliant and Cray DIFFICULTY: to be competitive with this sort of hardware, you need real compiler experts on your team. Harder than vectorizing. Parallelizing, fine grain, large numbers of processors. What Tera is trying to do. DIFFICULTY: very complex. Leading edge. Good luck, Tera. Multiple functional units - heterogenous - VLIW or superscalar DIFFICULTY: complex. Complexity of the packing algorithms grows rapidly as you try to go beyond basic block parallelism Multiple functional units - homogenous - VLIW or superscalar DIFFICULTY: moderately complex Easier than the heterogenous case, and the packing algorithms are considerably easier. Special hardware instructions - scalar Taking advantage of simple instructions like abs(), conditional exchange, etc. DIFFICULTY: (1) When treated not as a compiler problem, but as a problem of simply writing libraries to inline optimized machine code, EASY Requires inlining support. (2) When treated as a compiler problem, eg. compiler should recognize IF a < 0 THEN a = -a and automatically convert it to a = ABS(a) MODERATELY COMPLEX. I expect Herman Rubin to comment on this. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]