Xref: utzoo comp.misc:2100 comp.arch:3926
Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!shukra!ram
From: ram%shukra@Sun.COM (Renu Raman, Taco-Bell Microsystems (A Bell Operating Co.))
Newsgroups: comp.misc,comp.arch
Subject: Re: Instruction Scheduling
Message-ID: <45292@sun.uucp>
Date: 14 Mar 88 07:22:28 GMT
References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> <12678@sgi.SGI.COM>
Sender: news@sun.uucp
Reply-To: ram@sun.UUCP (Renu Raman, Taco-Bell Microsystems (A Bell Operating Co.))
Organization: Sun Microsystems, Mountain View
Lines: 106
Keywords: optimization   pipeline constraints  code re-organization

In article <12512@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes:
>This isn't really an architectural issue, so probably doesn't
>really belong in comp.arch.  However, it does have interesting

     [This IS an architecural issue: Expanding the realm of processors that
     you have mentioned, include vector processors and multiple functional
     unit processors.]

>architectural impact.  (For example, a machine similar to the
>Multi-FLow VLIW (tm I think) machine could probably have been
>BUILT years ago, but without something like a trace-scheduler,
>
>Bron Nelson  bron@sgi.com
>Don't blame my employers for my opinions

 neither mine.

 [Hi Bron] 

     Great!  I have been wanting to write about scoreboarding and
     here comes along something.  OK.  Ardent's new machine as well Motorola's
     (probably announced) and apollo's pre-announced machines all have some HW
     dedicated to do scoreboarding (in Ardent's case 1/3 of the vector control
     gate count is dedicated to scoreboarding. Don't have any
     info on Apollo's & Motorola's scoreboarding yet).

     In the case of Ardent their scoreboaring is dedicated to the vector
     ops (correct me - the info that I have is very hazy) but in the case of
     Apollo & Moto, it is for multiple scalar FUs.

     HW support for dynamic scheduling of code first appeared in the 
     CDC 6600 (Thornton)  and later on in IBM 360/?? - specifically for the
     FP section.  (I think Tomasulo was involved or wrote about it).

     What is amazing is that in the Cray-1, which had multiple functional
     units, the issue scheme was at most 1/cycle. (gravest source of imbalance).
     Weiss & Smith [1](Astronautics Corp.) showed that by tagging the register
     file on the cray to do the dependecy checks, the performance 
     could be improved by as much as 43%(bare bones without any
     compiler optimizations thru static re-organization)[No estimates on
     likely increase in si area].  In any case Cray avoided any sort of
     scoreboarding!!!

     Some of the  + & - in dynamic scheduling are:

     (1/2)+. Compiler can be unburdened. 
	  
	  It does not make sense as in general you compile once and run
     the program zillion times.  One can always turn off optimizations to
     go thru edit-compile-debug cycle faster.

     (1/2)+ Performance is not tied to quality of compiler:

          Again not a real plus.  Maybe true of the PC world where
     a compiler at $40 and $200 makes a difference.

     +/- Handling of Dynamic dependencies[DS] that a compiler cannot: Valid case.
     For procedural languages (where most of the processing is on today),
     what % of overall data dependencies are dynamic and what % statically
     determinable? Anybody have any nos. or references to prior work?
     One possiblity is that interpreted  & OO languages could benefit more
     than compiled languages.  Also, how much value is added in DS with good
     static scheduling techniques?  [DS is a good case for the memory
     bank conflict problem].

     ? Handling of Changes in processor configurations: ??? (turning on & off
     functional units?)

     - Context switches & Interrupts. (embedded controllers?):
     Seems to be exception handling and context switches
     (esp. for embedded controller applications
     where some of the RISCs find their way) are likely candidates for
     PIA (Pain In ....).

     + Multiple Instruction streams & scheduling in multiprocessor systems.
       Again a good candidate but too early to predict if that will
       be a real advantage.

     -. Increase in chip area.  True, but by how much?.  More critical issue
     is time(and si) spent in book keeping the dependencies in the
     instruction queue and overheads in the handling of interrupts &
     context switches.

     A much later reference is Torng et. al [TOC Sep 86]. [nos. ranging
     from 80-120 % in improvement are reported by using a despatch
     stack and associated book-keeping HW.  [Their reference base is not
     clear to me].  

     Is there any justification to dedicate a  lot of HW for scoreboarding
     (it is probably too early to talk about it, and many may not even
     talk about it) or can we trust the compiler to do it all?  It seems
     scoreboarding can be done efficiently by the compiler, except in the
     case where there are multiple instruction streams threading thru
     different processors (if somebody has built that, then they have
     a data flow machine) or when there are lots of autonomously running
     functional units making the code re-organization  a hair raising
     task.  Cydrome has addressed some of tradeoffs with scheduling
     in their cydra (see "Cydra 5 Directed Dataflow architecture" - Compcon 86).

     "There is a trend to move software problems to HW and code
     scheduling seems to be a candidate"!!! - Weiss & Smith (Will it be?)
---------------------
   Renukanthan Raman				ARPA:ram@sun.com
   Sun Microsystems			UUCP:{ucbvax,seismo,hplabs}!sun!ram
   M/S 18-41, 2500 Garcia Avenue,
   Mt. View,  CA 94043