Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU Path: utzoo!decvax!bellcore!ulysses!ucbvax!apollo From: DAVID@MIT-MC.ARPA ("David M. Krowitz") Newsgroups: mod.computers.apollo Subject: (none) Message-ID: <[MC.LCS.MIT.EDU].785921.860116.DAVID> Date: Thu, 16-Jan-86 18:10:28 EST Article-I.D.: <[MC.LCS.MIT.EDU].785921.860116.DAVID> Posted: Thu Jan 16 18:10:28 1986 Date-Received: Sat, 18-Jan-86 11:17:07 EST Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 180 Approved: apollo@yale-comix.arpa Sorry about my last message ... I hit the wrong key and it went off half finished. Here is the complete message. -------------------------------------------------------------------------- Our lab here at MIT does some fairly long (eg. 7 days of CPU time on a DN660) calculations, so we went through the exercise of looking at array processors (eg. Numerix 432, CSPI 6420, FPS 164) and fast standalone computers (eg. VAX 8600, Convex C-1) early last Spring and Summer. We had a number of constraints on our decision: 1) we needed fast performance on double precision (64 bit) floating point arithmetic. Some of our calculations are sensitive to round off errors. 2) the machine had to be able to co-exist with our Apollo workstations. The interactive graphics of the Apollos is a feature we are not about to give up to give up. The machine either had to be attached to an Apollo node (ie. an attached arrary processor) or had to have ethernet access. 3) the performance advantage had to be at least a factor of 5 better than the DN660, otherwise it wasn't worth the trouble. 4) we only had $150,000 to start with. The minimum machine configuration including software had to be something which could be bought on the budget of a single professor. A machine which offered a growth path to higher performance levels (above 5 X DN660) was desirable. Scripts Oceanographic Institute in La Jolla, CA has an Apollo ring similar to ours here at MIT and does similar work with it. (in fact the professor I work for and most of our post docs came from Scripps) They have been trying to attach a Numerix 432 array processor to one of there nodes for awhile with the classical limitations of an AP: 1) there is a time delay while your data is converted to the internal floating point format of the AP and is loaded into the AP's memory. 2) the AP company has to develop special hardware and software to link their AP to the Apollo hardware and software. Most AP companies make their machines to hook onto VAX 11/780's, not Apollo's. 3) ApP's are good at doing vector and matrix arithmetic and poor at doing I/O and scalar code. Our programs have a mix of scalar and vector code. If an AP has an infinitely fast vector unit, but only a 1 MIPS scalar unit, and if your program is 80% vector code (a very high fraction) you will get only a 5 to 1 speed up at the best. (assuming that the DN660 is roughly 1MIPs). Scripps had problems getting the Numerix machine to work reliably with the Apollo (apparently software interface problems), and it is only a 32 bit floating point unit -- much slower when doing 64 bit arithmetic, so we wrote Numerix off. We saw a presentation by CSPI on their 6420 which claimed a 5 MFLOP peak rate, and which was not a vector machine. In addition, it had its own Fortran compiler which could run any fortran program which did not include I/O. It still had the limitations of having to send your data to the AP memory, convert it into the AP format, running the program, and then having to reconvert the data and load it back into the host. Unfortunately, CSPI did not have the resources to build an interface for the Apollo. They could sell us a micro vax with the 6420 attached to it and an ethernet interface to the Apollo. This would require us to use 3 operating systems (the Apollo, VAX/VMS, and the AP's system). They did do 64 bit arithmetic and were only about $120,000 though. The minimum FPS 164 configuration was more than $250,000 at the time we looked at it, so that was out of the question. The Alliant Computer Systems Corp. let us get a look at their machines prior to their product announcement and also were more than willing to let us run benchmarks on their development machines (something which CSPI in particular discouraged us from doing). They have two machines, the FX/1 which costs about $130,000 in its minimum configuration with software , and the FX/8 which runs closer to $250,000 in its minimum configuration. Both machines use the same basic hardware modules and are object code compatable same hardware modules. The FX/8 is an upgradable system where as the FX/1 has a much smaller cabinet and has no space for an upgrade. The FX machines have two sets of processors: interactive processors (IP's) and Computational element (CE's) which share a global memory and cache system. The IP's handle multiple users doing I/O, editing jobs, the Unix kernal, compiling and the like. Jobs from the timesharing queue are scheduled for the first available IP (ie. multiprocessing of independent user jobs and system processes). The IP's are Motorola 68012 based processors and each IP can have its own I/O bus, so you can spread out the disk controllers and terminal controllers onto seperate IP's to avoid bottlenecks in I/O. The FX/1 can have one or two IP's to handle interactive time sharing jobs and I/O and the FX/8 can have from one to twelve of these processors. The CE's are custom CMOS processors which implement and Motorola 68020 instruction set, on top of which is an IEEE standard floating point instruction set, on top of which is a vector instruction set, on top of which is a 'concurency control' (parallel processing) instruction set. The basic CE will run at a Whetstone rate of about 4 million (Alliant advertized 4.3 MIPS, and the Whetstone program they gave us gives that rate, our own program runs a little slower). The vector unit gives results that are 10 to 30 times the speed (for say a matrix multiply or matrix add) of a DN660. They advertize a peak vector rate of 11.7 MFLOPS, but that is a instantaeos peak rate on a register to register instruction (if I understand my salesman correctly). They (Alliant's salesmen that I have met) tend to stay away from the peak rates and to concentrate on their benchmark results (see below). The Alliant Fortran compiler looks at a program's DO loops very carefully. If the loop can be done as a vector, then a vector instruction is generated to perform the loop. If the loop can't be vectorized, but it can be done in parallel with other passes of the the loop then the compiler sets up a hardware loop count and a concurent processing init instruction. When the program (running in scalar mode on a single CE) hits the concurrency instruction it grabs the value of the loop counter and begins to execute the first pass of the loop. If there is another CE in the machine, it will immediately check the loop counter to see if there is another pass of the loop to be done. If so, the second CE will increment the counter and begin executing the second pass of the loop. If there is a third CE, it will also take another pass of the loop. Up to 8 CE's will fit in the FX/8. The FX/1 will hold only one CE. Programs are automatically scheduled to run on the CE complex or on of the IP's as needed. You do not have to do any special subroutine calls to start the process or to load any data (both the IP's and the CE complex run out of the same global memory). The scheduler simply runs a job on the CE complex for as long as the job doesn't do any I/O or any Unix system calls. When a job hits an I/O request or a system call it is automatically rescheduled for the first available IP (or the IP which has the I/O device attached to it). Benchmark results ... The following numbers are published by Alliant in there handouts: (Megaflops) Livermore Loop FX/8 (8CE's) Cray 1-S Vax11/780 with FPU --------------- ------------ -------- ------------------ 1)Hydro Excerpt 30.3 100.0 0.39 2)Inner Product 23.3 41.7 0.34 3)Inner Product 25.6 33.3 0.30 4)Banded Lin. Eq. 8.10 24.3 0.20 5)Tri-Diag(above diag) 0.96 7.70 0.21 6)Tri-Diag(below) 0.84 7.70 0.23 7)Equation of State 36.2 120.0 0.51 8)PDE Integration 5.16 55.4 0.40 9)Integral Predictors 30.4 68.0 0.43 10)Difference Pred. 4.33 36.0 0.20 11)First Sum 0.54 2.90 0.15 12)First Difference 13.2 25.0 0.17 13)Particle Pusher 0.85 4.0 0.09 14)1-D Particle Pusher 1.28 5.60 0.18 Note that these numbers were published by Alliant (not us!) about 8 months ago. They may have changed. I also don't know if the VAX was a VMS system or a BSD 4.2 system. I have seen our Apollo salesman carrying around FX/1 and FX/8 linterature, so you could probably get some from your local office if you press them for it. The Alliant sales people will probably be able to give you a better explanation of their machines, though. They should also have some newer results, too. When we were still considering whether or not to buy our machine they were quite willing to come out to MIT and give a seminar on their techniques and their hardware. Running benchmarks of our own programs was something they had no problem doing. We gave them a tape of VMS compatible fortran programs and they called up and gave us the results (we actually ran some of the shorter ones ourselves). Right now what Apollo can offer you is the Alliant machine running BSD 4.2 and its own Fortran compiler, and Apollo's ethernet gateway software (FTP for sending ascii files and TELNET for remote logins). We have written some programs which will reformat binary data files (files created by a Fortran WRITE statement not using a FORMAT statement) so that they can be sent between the machines (both Apollo and Alliant use 68020 integer arithmetic and IEEE floating point arithmetic so only the record format needs to be twiddled). We have heard that the future will provide us with transparent file access and remote logins (ie. FTP file transfers will be unnecessary, you will access files across the network the same way you do across the Apollo ringnet). -- David Krowitz ( DAVID@MIT-MC.ARPA or KROWITZ@MIT-MARIE@MIT-MC.ARPA )