Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!bloom-beacon!apple!vsi1!wyse!mips!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: Object Translator (was Re: Register Scoreboarding) Message-ID: <19162@winchester.mips.COM> Date: 10 May 89 12:12:59 GMT References: <491@bnr-fos.UUCP> Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems, Sunnyvale, CA Lines: 82 In article <491@bnr-fos.UUCP> schow%BNR.CA.bitnet@relay.cs.net (Stanley Chow) writes: >In article grunwald@flute.cs.uiuc.edu writes: >>In his picture, when you have a new architecture with, say, more registers, >>different delay costs or deeper pipelines, you translate your .o files and/or >>your final binaries. This is essentially what scoreboarding is doing, albeit >>dynamically. There are some limits to this approach, some obvious, some not >As far as object translation as related to comp.arch, I think it is at best >a kludge. Some architectures may well need it since as you (or Hennesy) >point out, different delay costs and lack of scorboarding make life very >interesting over the long term. >Basically, the object code (for any architecture) is a very poor medium >for communicating algorithm or program intent. Most emulators have trouble >just faithfully micmacing the target system! Optimizing translators sound >very hard to me. :-) Actually, a lot of object-code translation has been used already: 1) MOXIE: MIPs On a vaX Instruction Emulator - xlated MIPS code -> VAX code to let us get fast execution before MIPS chips existed (i.e. ,faster than regular simulator) 2) PIXIE - MIPS -> MIPS, adding profiling information and address-trace gathering code 3) Various things at Ardent Computer for debugging and other reasons 4) A PIXIE variant done at Berkeley to convert MIPS code -> SPARC code (!) Needless to say, the MS/DOS emulation is important comemrcially. Finally, the most important commercial application I know of is HP's use of various techniques to run HP3000 object code on the HP PA machines. On the other hand, be careful in interpreting John's remarks as a claimed intent for what MIPSco is doing. In particular, Motorola & co are persistent in claiming that the world will fall apart for MIPS if the timings of the floating-point operations change, despite the fact that it has clearly been stated many times that we have complete interlocking on ALL of the multi-cycle operations. Really, the only things that don't have interlocking are loads and equivalents (i.e., move-from-coprocessor), and they all have a 1-cycle delay that is predictable to the compilers. The (Without) in Microprocessor (Without) Interlocking Pipeline Stages, which may have been appropriate for the Stanford MIPS, is pretty much irrelevant when it comes to MIPSco MIPS. As I've said here before, if we ended up with loads that had another cycle of latency, we;d build a machine with an interlock on the extra cycle. If we decided to put in load interlocks, that would be upward-compatible, although we'd likely compile 3rd-party executables with R3000-style forever. (Of course, if we did add load interlocks at some point, and if there got to be more of those machines around, at some point maybe we'd start advising peopel to compile for that, and then do a reverse-translate on R3000-machines!) If the timings of floating-point operations are different (and they are) in forthcoming products, the existing object code works fine. However, even with completely interlocked and/or scoreboarded code, you STILL want the compilers to be as aggressive as possible. Fortunately, the way most of these things work, if you try to optimize for the version with the longest latencies, it usually works pretty well for ones with shorter latencies as well. To see this, suppose you had a 5-cycle FP multiply, and so you'd been generating code that tried to issue 4 more instructions before using the result of the multiply. IF the multiply expanded to 10 cycles, the compiler folks would try to work harder and find more things to do while the multiply were running, which wouldn't usually hurt the machine with the 5-cycle multiply. It's just a question of the number of stall cycles, and it's obvious that it almost always pays to spread the computation of a multi-cycle result, and the use of that result as far apart as possible. This, of course, is not remotely a new issue: any of the long-lived computer product lines has faced this, especially those that cover a range of implementation technologies, such as VAXen or S/360s. The solutions are the same, except that the simplicity of RISC-style instructions makes it marginally easier to manipulate object code. Our experience with these methods tends to make us more willing to consider object code translation as one more trick to use when it makes sense, and it's really not that weird once you get used to it. -- -john mashey DISCLAIMER: UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086