Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!mips!winchester!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Why The Move To RISC Architectures?  ('386 vs. RISC)
Message-ID: <37285@mips.mips.COM>
Date: 23 Mar 90 04:35:02 GMT
References: <28012@cup.portal.com> <1990Mar20.175843.2612@utzoo.uucp> <5303@scolex.sco.COM> <1268@m3.mfci.UUCP> <1990Mar22.184122.7917@ultra.com> <8912@boring.cwi.nl>
Sender: news@mips.COM
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Inc.
Lines: 80

In article <8912@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <1990Mar22.184122.7917@ultra.com> shj@ultra.com (Steve Jay) writes:
> >                                     By 1970, however, CDC had a new
> > compiler, FTN, which did rearrange instructions to optimize usage
> > of the multiple functional units.  The technology of both local and
> > global optimization in the FTN compiler was continously improved,
> > and by mid to late 70's, it was difficult to beat the compiler even
> > with hand tuned assembly language.
>And then came the problem.  CDC came with newer versions of their machine,
>and newer versions of their compiler.  The problem was that different
>machines had different requirements with respect to scheduling.  So a
>program fully optimized for a 7600 was not optimal for a 170/750.  There
>were switches in the compiler to tune for the different models, but at...

>This is in general a problem if the compiler has too much to do.
>Newer models of the machine require a different compiler.  And not
>only newer models, but if you have a range of models differing only in
>price and performance, you may have introduced different scheduling
>requirements for the different models.  Although your architecture can
>be such that object code compiled for one model is valid for another
>model, it may be sub-optimal.  And think next about the hassle to
>maintain different versions of the compiler!

This issue, of course, is almost certainly true for every line of
computers that
	a) Has multiple distinct implementations at the same time.
	b) Evolves over time by anything but clock-rate changes to the
	same implementation.

Product families for which optimal code differs among models includes
at least:
	a) IBM S/360 and derivatives.  Even amongst the first round of
	S/360s, optimal code differed.  (Note that IBM compiler folks
	observed that pipeline scheduling was useful on some machines...)
	b) DEC VAXen
	c) Intel 80x86
	d) Motorola 680x0
	e) SPARC (different FPU timings already, for example, and if the
	next generation has multiple different styles of pipelines...)
	f) MIPS Rx000  (R2000s always had 1-cycle writes; R3000s with
	approp. mode bit use 2-cycle write-partial-words; R6000s have
	different FP timings, etc).

Fortunately for the simpler architectures:
	a) Integer instructions are fairly simple, understandable, and maybe
	even the same with regard to timing amongst different implementations.
	b) Floating point operations are much more likely to vary, but they're
	probably less likely to be interchangeable, so you do what you can.
	c) If you're lucky, the pipeline constraints may be such that
	you:
		1) Want to work harder for things with deeper pipelines,
		in terms of spreading operations apart to lessen stalls.
		2) Want to work harder for more aggressive machines that
		have more concurrency.
	Fortunately, at least in some cases, there are optimizations
	for the more aggressive machiensthat help them, but certainly
	don't hurt the less aggressive machines much, if at all.
	For instance, if machine (n+1) has longer-latency loads than (n),
	trying harder to move references to the data later probably won't
	hurt (n).

At least you don't have to fight with issues like:
	-Model A has a (multi-cycle) serial shifter, and every shift position
	costs a cycle, but B has a barrel shifter, where the cost is constant,
	regardless of shift count, and both have multipliers of differing
	speeds, so the optimal sequences to do multiplies by constants
	are completely different, and the cutover from shifts+add/subtract
	to actual multiply is completely different.
	-On Model A, to copy 8 bytes from here to there, use a move-character,
	because it has narrow data paths anyway and microcode, but on
	model B, use load/store, because THOSE are hardwired, and go faster
	than doing move-character, because the startup time dominates....

Anyway, CDC was hardly alone in this...it's a fact of life for everybody
that does multiple implementations.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253, or 408-524-7015
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086