Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!lll-crg!lll-lcc!unisoft!mtxinu!ed From: ed@mtxinu.UUCP (Ed Gould) Newsgroups: net.arch Subject: Re: Delayed Loads Message-ID: <86@mtxinu.UUCP> Date: Thu, 18-Sep-86 20:08:21 EDT Article-I.D.: mtxinu.86 Posted: Thu Sep 18 20:08:21 1986 Date-Received: Sat, 20-Sep-86 00:15:31 EDT References: <5100133@ccvaxa> <486@weitek.UUCP> <694@mips.UUCP> Reply-To: ed@mtxinu.UUCP (Ed Gould) Organization: mt Xinu, Berkeley, CA Lines: 48 >e) The CDC 6600 probably falls in e), i.e., the FORTRAN compiler would >rearrange code to help things go fast, but the hardware could handle >all of the interlocks itself [I think. Anybody know different?] I suspect it's true of the 6600; it's definitely true of the 6400, which is the low-end machine of the original 6000 series. (CDC later came out with the 6200, but it was really a slowed-down 6400.) Even on this low-end machine, many of the same reorderings worked as on the 6600, even thought the 6400 had no parallelism. These optimizations had to do with where within the 60-bit word the instructions - which were generally either 15 or 30 bits long - landed. Other opimizations - ones which took advantage of the parallelism of the 6600 - were either meaningless on the 6400, or, sometimes, a cycle slower than the obvious sequence! The compilers that did optimizations needed to know which member of the family the code was for. One example of this type of optimization that I remember was when copying two X registers (the machine's accumulators, more or less; they're 60-bit registers) into two other registers. The obvious sequence is BX1 X2+X2 bitwise "or" of X2 with X2 into X1 BX3 X4+X4 likewise for X4 into X3 Redundant operatorands could be elided, so that the X2+X2 could be abbreviated by just using X2. The 6600 had separate functional units - operating in parallel with interlocks on using the results - including a "boolean" unit to do the "B" instructions and a "logical" unit that did shifts - "L" instructions. LX1 X2,B0 copy X2 into X1 using the "logical" unit BX3 X4+X4 copy X4 into X3 using the "boolean" unit The LX1 instruction left-circular-shifts X2 by the number of bits specified by the value of B0, which is a hard-wired 0, and leaves the resunt in X1. (The other seven B registers were real 18-bit registers. They are essentially index registers; addresses are 18 bits.) The L unit was typically one cycle slower than the B unit, so the above sequence was optimal on a 6600, where both instructions would finish at the same time. On a 6400, however, (if I remember correctly) the L instructions were also one cycle slower than the B instructions, so that the optimized sequence would be one cycle slower than the obvious sequence. -- Ed Gould mt Xinu, 2560 Ninth St., Berkeley, CA 94710 USA {ucbvax,decvax}!mtxinu!ed +1 415 644 0146 "A man of quality is not threatened by a woman of equality."