Xref: utzoo comp.misc:2100 comp.arch:3926 Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!shukra!ram From: ram%shukra@Sun.COM (Renu Raman, Taco-Bell Microsystems (A Bell Operating Co.)) Newsgroups: comp.misc,comp.arch Subject: Re: Instruction Scheduling Message-ID: <45292@sun.uucp> Date: 14 Mar 88 07:22:28 GMT References: <12513@sgi.SGI.COM> <12560@sgi.SGI.COM> <12678@sgi.SGI.COM> Sender: news@sun.uucp Reply-To: ram@sun.UUCP (Renu Raman, Taco-Bell Microsystems (A Bell Operating Co.)) Organization: Sun Microsystems, Mountain View Lines: 106 Keywords: optimization pipeline constraints code re-organization In article <12512@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes: >This isn't really an architectural issue, so probably doesn't >really belong in comp.arch. However, it does have interesting [This IS an architecural issue: Expanding the realm of processors that you have mentioned, include vector processors and multiple functional unit processors.] >architectural impact. (For example, a machine similar to the >Multi-FLow VLIW (tm I think) machine could probably have been >BUILT years ago, but without something like a trace-scheduler, > >Bron Nelson bron@sgi.com >Don't blame my employers for my opinions neither mine. [Hi Bron] Great! I have been wanting to write about scoreboarding and here comes along something. OK. Ardent's new machine as well Motorola's (probably announced) and apollo's pre-announced machines all have some HW dedicated to do scoreboarding (in Ardent's case 1/3 of the vector control gate count is dedicated to scoreboarding. Don't have any info on Apollo's & Motorola's scoreboarding yet). In the case of Ardent their scoreboaring is dedicated to the vector ops (correct me - the info that I have is very hazy) but in the case of Apollo & Moto, it is for multiple scalar FUs. HW support for dynamic scheduling of code first appeared in the CDC 6600 (Thornton) and later on in IBM 360/?? - specifically for the FP section. (I think Tomasulo was involved or wrote about it). What is amazing is that in the Cray-1, which had multiple functional units, the issue scheme was at most 1/cycle. (gravest source of imbalance). Weiss & Smith [1](Astronautics Corp.) showed that by tagging the register file on the cray to do the dependecy checks, the performance could be improved by as much as 43%(bare bones without any compiler optimizations thru static re-organization)[No estimates on likely increase in si area]. In any case Cray avoided any sort of scoreboarding!!! Some of the + & - in dynamic scheduling are: (1/2)+. Compiler can be unburdened. It does not make sense as in general you compile once and run the program zillion times. One can always turn off optimizations to go thru edit-compile-debug cycle faster. (1/2)+ Performance is not tied to quality of compiler: Again not a real plus. Maybe true of the PC world where a compiler at $40 and $200 makes a difference. +/- Handling of Dynamic dependencies[DS] that a compiler cannot: Valid case. For procedural languages (where most of the processing is on today), what % of overall data dependencies are dynamic and what % statically determinable? Anybody have any nos. or references to prior work? One possiblity is that interpreted & OO languages could benefit more than compiled languages. Also, how much value is added in DS with good static scheduling techniques? [DS is a good case for the memory bank conflict problem]. ? Handling of Changes in processor configurations: ??? (turning on & off functional units?) - Context switches & Interrupts. (embedded controllers?): Seems to be exception handling and context switches (esp. for embedded controller applications where some of the RISCs find their way) are likely candidates for PIA (Pain In ....). + Multiple Instruction streams & scheduling in multiprocessor systems. Again a good candidate but too early to predict if that will be a real advantage. -. Increase in chip area. True, but by how much?. More critical issue is time(and si) spent in book keeping the dependencies in the instruction queue and overheads in the handling of interrupts & context switches. A much later reference is Torng et. al [TOC Sep 86]. [nos. ranging from 80-120 % in improvement are reported by using a despatch stack and associated book-keeping HW. [Their reference base is not clear to me]. Is there any justification to dedicate a lot of HW for scoreboarding (it is probably too early to talk about it, and many may not even talk about it) or can we trust the compiler to do it all? It seems scoreboarding can be done efficiently by the compiler, except in the case where there are multiple instruction streams threading thru different processors (if somebody has built that, then they have a data flow machine) or when there are lots of autonomously running functional units making the code re-organization a hair raising task. Cydrome has addressed some of tradeoffs with scheduling in their cydra (see "Cydra 5 Directed Dataflow architecture" - Compcon 86). "There is a trend to move software problems to HW and code scheduling seems to be a candidate"!!! - Weiss & Smith (Will it be?) --------------------- Renukanthan Raman ARPA:ram@sun.com Sun Microsystems UUCP:{ucbvax,seismo,hplabs}!sun!ram M/S 18-41, 2500 Garcia Avenue, Mt. View, CA 94043