Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!bloom-beacon!apple!vsi1!wyse!mips!mash
From: mash@mips.COM (John Mashey)
Newsgroups: comp.arch
Subject: Re: Object Translator (was Re: Register Scoreboarding)
Message-ID: <19162@winchester.mips.COM>
Date: 10 May 89 12:12:59 GMT
References: <GRUNWALD.89May9113443@flute.cs.uiuc.edu> <491@bnr-fos.UUCP>
Reply-To: mash@mips.COM (John Mashey)
Organization: MIPS Computer Systems, Sunnyvale, CA
Lines: 82

In article <491@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes:
>In article <GRUNWALD.89May9113443@flute.cs.uiuc.edu> grunwald@flute.cs.uiuc.edu writes:
>>In his picture, when you have a new architecture with, say, more registers,
>>different delay costs or deeper pipelines, you translate your .o files and/or
>>your final binaries. This is essentially what scoreboarding is doing, albeit
>>dynamically. There are some limits to this approach, some obvious, some not

>As far as object translation as related to comp.arch, I think it is at best
>a kludge. Some architectures may well need it since as you (or Hennesy) 
>point out, different delay costs and lack of scorboarding make life very
>interesting over the long term.
 
>Basically, the object code (for any architecture) is a very poor medium
>for communicating algorithm or program intent. Most emulators have trouble
>just faithfully micmacing the target system! Optimizing translators sound
>very hard to me. :-)

Actually, a lot of object-code translation has been used already:
	1) MOXIE: MIPs On a vaX Instruction Emulator - 
		xlated MIPS code -> VAX code to let us get fast execution
		before MIPS chips existed (i.e. ,faster than regular
		simulator)
	2) PIXIE - MIPS -> MIPS, adding profiling information and
		address-trace gathering code
	3) Various things at Ardent Computer for debugging and other
		reasons
	4) A PIXIE variant done at Berkeley to convert MIPS code -> SPARC
		code (!)

Needless to say, the MS/DOS emulation is important comemrcially.
Finally, the most important commercial application I know of is
HP's use of various techniques to run HP3000 object code on the HP PA machines.

On the other hand, be careful in interpreting John's remarks as a
claimed intent for what MIPSco is doing.
In particular, Motorola & co are persistent in claiming that the world
will fall apart for MIPS if the timings of the floating-point operations
change, despite the fact that it has clearly been stated many times
that we have complete interlocking on ALL of the multi-cycle operations.
Really, the only things that don't have interlocking are loads and
equivalents (i.e., move-from-coprocessor), and they all have a 1-cycle
delay that is predictable to the compilers.  The (Without) in
Microprocessor (Without) Interlocking Pipeline Stages, which may have
been appropriate for the Stanford MIPS, is pretty much irrelevant
when it comes to MIPSco MIPS.
As I've said here before, if we ended up with loads that had another
cycle of latency, we;d build a machine with an interlock on the extra
cycle.  If we decided to put in load interlocks, that would
be upward-compatible, although we'd likely compile 3rd-party executables
with R3000-style forever. (Of course, if we did add load interlocks at
some point, and if there got to be more of those machines around, at some
point maybe we'd start advising peopel to compile for that, and then do
a reverse-translate on R3000-machines!)
If the timings of floating-point operations
are different (and they are) in forthcoming products, the existing object
code works fine.  However, even with completely interlocked and/or
scoreboarded code, you STILL want the compilers to be as aggressive
as possible.  Fortunately, the way most of these things work, if you
try to optimize for the version with the longest latencies, it usually
works pretty well for ones with shorter latencies as well.  To see this,
suppose you had a 5-cycle FP multiply, and so you'd been generating code
that tried to issue 4 more instructions before using the result of the
multiply.  IF the multiply expanded to 10 cycles, the compiler folks
would try to work harder and find more things to do while the multiply
were running, which wouldn't usually hurt the machine with the 5-cycle
multiply.  It's just a question of the number of stall cycles, and
it's obvious that it almost always pays to spread the computation of a
multi-cycle result, and the use of that result as far apart as possible.

This, of course, is not remotely a new issue: any of the long-lived
computer product lines has faced this, especially those that
cover a range of implementation technologies, such as VAXen or S/360s.
The solutions are the same, except that the simplicity of RISC-style
instructions makes it marginally easier to manipulate object code.
Our experience with these methods tends to make us more willing to
consider object code translation as one more trick to use when it makes
sense, and it's really not that weird once you get used to it.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086