Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ukma!xanth!lll-winken!ubvax!ardent!mrk!mac From: mac@mrk.ardent.com (Michael McNamara) Newsgroups: comp.arch Subject: [HS]W interlocks (was: Fujitsu SPARC Interlocks) Message-ID: <3007@ardent.UUCP> Date: 11 Feb 89 22:13:55 GMT References: <28200269@mcdurb> <28200273@mcdurb> Sender: news@ardent.UUCP Reply-To: mac@mrk.ardent.com (Michael McNamara) Organization: Ardent Computer Corporation, Sunnyvale, CA Lines: 48 I like a machine with hardware interlocks for compatibility, and a compiler with instruction scheduling so that code compiled with the newest compiler on the newest box runs as fast as possible. It may seem like sacrilege to for the compiler not to just rely completely on the expensive hardware interlocks you built, but faster code can be crafted by the compiler scheduling code that will rarely/never experience a hardware interlock.* All the cycles that an instruction is delayed by hardware interlock are cycles that the machine could be issuing other operations. Then when the new box comes out, the old binary will run on the new machine, and generate the same answers as before, but not as quickly as if the code were recompiled with the new compiler (Or the old compiler with a new machine decription table). The old binary will either 1) experience hardware interlocks due to a slower relative operation/memory latency/cache read/fill in the new machine, or 2) operations will be issued later than they could have been as the new machine's operation/memory/cache speed ratios are different than the old's. If the compiler inserts only the architectually required nops (empty branch/load delay slots) then delays due to 2) will be reduced; this is certainly a resonable place for the compiler to take advantage of hardware interlocks. IE, the compiler should only delay the issuance of data dependent instructions by moving other NON data dependent constrained instruction(s) above the instruction. If there isn't anything else to move before the instruction, DON'T insert nops; let the hardware interlock scoreboard this operation, and hence a later faster machine can run the same binary faster. --------- * I observed the benefits first hand of a code constructed by a compiler that really understood it's machine while at Cydrome. The machine/compiler pair got 15 MFLOPS out of a peak 25 on 100x100 Linpack, an efficiecy of 60%; it got 5.8 MFLOPS out of peak 25 MFLOPS on the 24 Livermore Loops, 23% efficiency. Few other machines come close to these efficiencies. Of course, it takes a while to write a compiler that so completely understands a machine, and if you try to build both the compiler and the hardware as a startup company, you can run out of time. Other companies have been more successful by taking a academic research compiler, and building a machine around that [Hi Bob C.] [disclaimer] Michael McNamara mac@ardent.com