Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!mips!winchester!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Why The Move To RISC Architectures? ('386 vs. RISC) Message-ID: <37285@mips.mips.COM> Date: 23 Mar 90 04:35:02 GMT References: <28012@cup.portal.com> <1990Mar20.175843.2612@utzoo.uucp> <5303@scolex.sco.COM> <1268@m3.mfci.UUCP> <1990Mar22.184122.7917@ultra.com> <8912@boring.cwi.nl> Sender: news@mips.COM Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Inc. Lines: 80 In article <8912@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <1990Mar22.184122.7917@ultra.com> shj@ultra.com (Steve Jay) writes: > > By 1970, however, CDC had a new > > compiler, FTN, which did rearrange instructions to optimize usage > > of the multiple functional units. The technology of both local and > > global optimization in the FTN compiler was continously improved, > > and by mid to late 70's, it was difficult to beat the compiler even > > with hand tuned assembly language. >And then came the problem. CDC came with newer versions of their machine, >and newer versions of their compiler. The problem was that different >machines had different requirements with respect to scheduling. So a >program fully optimized for a 7600 was not optimal for a 170/750. There >were switches in the compiler to tune for the different models, but at... >This is in general a problem if the compiler has too much to do. >Newer models of the machine require a different compiler. And not >only newer models, but if you have a range of models differing only in >price and performance, you may have introduced different scheduling >requirements for the different models. Although your architecture can >be such that object code compiled for one model is valid for another >model, it may be sub-optimal. And think next about the hassle to >maintain different versions of the compiler! This issue, of course, is almost certainly true for every line of computers that a) Has multiple distinct implementations at the same time. b) Evolves over time by anything but clock-rate changes to the same implementation. Product families for which optimal code differs among models includes at least: a) IBM S/360 and derivatives. Even amongst the first round of S/360s, optimal code differed. (Note that IBM compiler folks observed that pipeline scheduling was useful on some machines...) b) DEC VAXen c) Intel 80x86 d) Motorola 680x0 e) SPARC (different FPU timings already, for example, and if the next generation has multiple different styles of pipelines...) f) MIPS Rx000 (R2000s always had 1-cycle writes; R3000s with approp. mode bit use 2-cycle write-partial-words; R6000s have different FP timings, etc). Fortunately for the simpler architectures: a) Integer instructions are fairly simple, understandable, and maybe even the same with regard to timing amongst different implementations. b) Floating point operations are much more likely to vary, but they're probably less likely to be interchangeable, so you do what you can. c) If you're lucky, the pipeline constraints may be such that you: 1) Want to work harder for things with deeper pipelines, in terms of spreading operations apart to lessen stalls. 2) Want to work harder for more aggressive machines that have more concurrency. Fortunately, at least in some cases, there are optimizations for the more aggressive machiensthat help them, but certainly don't hurt the less aggressive machines much, if at all. For instance, if machine (n+1) has longer-latency loads than (n), trying harder to move references to the data later probably won't hurt (n). At least you don't have to fight with issues like: -Model A has a (multi-cycle) serial shifter, and every shift position costs a cycle, but B has a barrel shifter, where the cost is constant, regardless of shift count, and both have multipliers of differing speeds, so the optimal sequences to do multiplies by constants are completely different, and the cutover from shifts+add/subtract to actual multiply is completely different. -On Model A, to copy 8 bytes from here to there, use a move-character, because it has narrow data paths anyway and microcode, but on model B, use load/store, because THOSE are hardwired, and go faster than doing move-character, because the startup time dominates.... Anyway, CDC was hardly alone in this...it's a fact of life for everybody that does multiple implementations. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253, or 408-524-7015 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086