Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ukma!xanth!lll-winken!ubvax!ardent!mrk!mac
From: mac@mrk.ardent.com (Michael McNamara)
Newsgroups: comp.arch
Subject: [HS]W interlocks (was: Fujitsu SPARC Interlocks)
Message-ID: <3007@ardent.UUCP>
Date: 11 Feb 89 22:13:55 GMT
References: <28200269@mcdurb> <28200273@mcdurb>
Sender: news@ardent.UUCP
Reply-To: mac@mrk.ardent.com (Michael McNamara)
Organization: Ardent Computer Corporation, Sunnyvale, CA
Lines: 48


	I like a machine with hardware interlocks for compatibility,
and a compiler with instruction scheduling so that code compiled with
the newest compiler on the newest box runs as fast as possible.

	It may seem like sacrilege to for the compiler not to just
rely completely on the expensive hardware interlocks you built, but
faster code can be crafted by the compiler scheduling code that will
rarely/never experience a hardware interlock.*  All the cycles that an
instruction is delayed by hardware interlock are cycles that the
machine could be issuing other operations.

	Then when the new box comes out, the old binary will run on
the new machine, and generate the same answers as before, but not as
quickly as if the code were recompiled with the new compiler (Or the
old compiler with a new machine decription table). The old binary will
either 1) experience hardware interlocks due to a slower relative
operation/memory latency/cache read/fill in the new machine, or 2)
operations will be issued later than they could have been as the new
machine's operation/memory/cache speed ratios are different than the
old's.

	If the compiler inserts only the architectually required nops
(empty branch/load delay slots) then delays due to 2) will be reduced;
this is certainly a resonable place for the compiler to take advantage
of hardware interlocks.  IE, the compiler should only delay the
issuance of data dependent instructions by moving other NON data
dependent constrained instruction(s) above the instruction. If there
isn't anything else to move before the instruction, DON'T insert nops;
let the hardware interlock scoreboard this operation, and hence a
later faster machine can run the same binary faster.

---------
	* I observed the benefits first hand of a code constructed by
a compiler that really understood it's machine while at Cydrome.  The
machine/compiler pair got 15 MFLOPS out of a peak 25 on 100x100
Linpack, an efficiecy of 60%; it got 5.8 MFLOPS out of peak 25 MFLOPS
on the 24 Livermore Loops, 23% efficiency.  Few other machines come
close to these efficiencies.  
	Of course, it takes a while to write a compiler that so
completely understands a machine, and if you try to build both the
compiler and the hardware as a startup company, you can run out of
time.  Other companies have been more successful by taking a academic
research compiler, and building a machine around that [Hi Bob C.]

[disclaimer]
Michael McNamara 
  mac@ardent.com