Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site ucbvax.BERKELEY.EDU
Path: utzoo!decvax!bellcore!ulysses!ucbvax!apollo
From: DAVID@MIT-MC.ARPA ("David M. Krowitz")
Newsgroups: mod.computers.apollo
Subject: (none)
Message-ID: <[MC.LCS.MIT.EDU].785921.860116.DAVID>
Date: Thu, 16-Jan-86 18:10:28 EST
Article-I.D.: <[MC.LCS.MIT.EDU].785921.860116.DAVID>
Posted: Thu Jan 16 18:10:28 1986
Date-Received: Sat, 18-Jan-86 11:17:07 EST
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 180
Approved: apollo@yale-comix.arpa

Sorry about my last message ... I hit the wrong key and it
went off half finished. Here is the complete message.
--------------------------------------------------------------------------


Our lab here at MIT does some fairly long (eg. 7 days of CPU time
on a DN660) calculations, so we went through the exercise of looking
at array processors (eg. Numerix 432, CSPI 6420, FPS 164) and fast
standalone computers (eg. VAX 8600, Convex C-1) early last Spring
and Summer. We had a number of constraints on our decision:

1) we needed fast performance on double precision (64 bit) floating
   point arithmetic. Some of our calculations are sensitive to
   round off errors. 
2) the machine had to be able to co-exist with our Apollo workstations.
   The interactive graphics of the Apollos is a feature we are not about
   to give up
   to give up. The machine either had to be attached to an Apollo node
   (ie. an attached arrary processor) or had to have ethernet access.
3) the performance advantage had to be at least a factor of 5 better
   than the DN660, otherwise it wasn't worth the trouble.
4) we only had $150,000 to start with. The minimum machine configuration
   including software had to be something which could be bought on the
   budget of a single professor. A machine which offered a
   growth path to higher performance levels (above 5 X DN660) was
   desirable.

Scripts Oceanographic Institute in La Jolla, CA has an Apollo ring
similar to ours here at MIT and does similar work with it. (in fact
the professor I work for and most of our post docs came from Scripps)
They have been trying to attach a Numerix 432 array processor to one of
there nodes for awhile with the classical limitations of an AP:

1) there is a time delay while your data is converted to
   the internal floating point format of the AP and is loaded into
   the AP's memory.
2) the AP company has to develop special hardware and software to link
   their AP to the Apollo hardware and software. Most AP companies make
    their machines to hook onto VAX 11/780's, not Apollo's. 
3) ApP's are good at doing vector and matrix arithmetic and poor at doing
   I/O and scalar code. Our programs have a mix of scalar and vector code.
   If an AP has an infinitely fast vector unit, but only a 1 MIPS
   scalar unit, and if your program is 80% vector code (a very high
   fraction) you will get only a 5 to 1 speed up at the best. 
   (assuming that the DN660 is roughly 1MIPs).

Scripps had problems getting the Numerix machine to work reliably with
the Apollo (apparently software interface problems), and it is only a
32 bit floating point unit -- much slower when doing 64 bit arithmetic,
so we wrote Numerix off.

We saw a presentation by CSPI on their 6420 which claimed a 5 MFLOP peak
rate, and which was not a vector machine. In addition, it had its own
Fortran compiler which could run any fortran program which did not
include I/O. It still had the limitations of having to send your data
to the AP memory, convert it into the AP format, running the program, and
then having to reconvert the data and load it back into the host.
Unfortunately, CSPI did not have the resources to build an interface
for the Apollo. They could sell us a micro vax with the 6420 attached
to it and an ethernet interface to the Apollo. This would require
us to use 3 operating systems (the Apollo, VAX/VMS, and the AP's
system). They did do 64 bit arithmetic and were only about $120,000
though.

The minimum FPS 164 configuration was more than $250,000 at the
time we looked at it, so that was out of the question.

The Alliant Computer Systems Corp. let us get a look at their
machines prior to their product announcement and also were more
than willing to let us run benchmarks on their development machines
(something which CSPI in particular discouraged us from doing). They
have two machines, the FX/1 which costs about $130,000 in its
minimum configuration with software , and the FX/8 which runs
closer to $250,000 in its minimum configuration. Both machines use the
same basic hardware modules and are object code compatable
same hardware modules. The FX/8 is an upgradable system where as
the FX/1 has a much smaller cabinet and has no space for an upgrade.

The FX machines have two sets of processors: interactive processors (IP's)
and Computational element (CE's) which share a global memory and cache system.
The IP's handle multiple users doing I/O, editing jobs, the Unix kernal,
compiling and the like. Jobs from the timesharing queue are scheduled for
the first available IP (ie. multiprocessing of independent user jobs and
system processes). The IP's are Motorola 68012 based processors and each IP
can have its own I/O bus, so you can spread out the disk controllers and
terminal controllers onto seperate IP's to avoid bottlenecks in I/O.
The FX/1 can have one or two IP's to handle interactive time sharing
jobs and I/O and the FX/8 can have from one to twelve of these
processors.

The CE's are custom CMOS processors which implement and Motorola 68020
instruction set, on top of which is an IEEE standard floating point
instruction set, on top of which is a vector instruction set, on top of
which is a 'concurency control' (parallel processing) instruction set.
The basic CE will run at a Whetstone rate of about 4 million (Alliant advertized
4.3 MIPS, and the Whetstone program they gave us gives that rate, our own
program runs a little slower). The vector unit gives results that are 10 to
30 times the speed (for say a matrix multiply or matrix add) of a DN660.
They advertize a peak vector rate of 11.7 MFLOPS, but that is a instantaeos
peak rate on a register to register instruction (if I understand my salesman
correctly). They (Alliant's salesmen that I have met) tend to stay away
from the peak rates and to concentrate on their benchmark results
(see below). The Alliant Fortran compiler looks at a program's DO
loops very carefully. If the loop can be done as a vector, then a
vector instruction is generated to perform the loop. If the loop can't be
vectorized, but it can be done in parallel with other passes of the the
loop then the compiler sets up a hardware loop count and a concurent
processing init instruction. When the program (running in scalar mode on a
single CE) hits the concurrency instruction it grabs the value of the loop
counter and begins to execute the first pass of the loop. If there is
another CE in the machine, it will immediately check the loop counter
to see if there is another pass of the loop to be done. If so, the second
CE will increment the counter and begin executing the second pass  of the
loop. If there is a third CE, it will also take another pass of the loop.
Up to 8 CE's will fit in the FX/8. The FX/1 will hold only one CE.

Programs are automatically scheduled to run on the CE complex or on of
the IP's as needed. You do not have to do any special subroutine calls
to start the process or to load any data (both the IP's and the CE complex
run out of the same global memory). The scheduler simply runs
a job on the CE complex for as long as the job doesn't do any I/O or any
Unix system calls. When a job hits an I/O request or a system call it is
automatically rescheduled for the first available IP (or the IP which has
the I/O device attached to it).

Benchmark results ...

The following numbers are published by Alliant in there handouts:
(Megaflops)

Livermore Loop		FX/8 (8CE's)	Cray 1-S	Vax11/780 with FPU
---------------		------------	--------	------------------

1)Hydro Excerpt		30.3		100.0		0.39
2)Inner Product		23.3		41.7		0.34
3)Inner Product		25.6		33.3		0.30
4)Banded Lin. Eq.	8.10		24.3		0.20
5)Tri-Diag(above diag)	0.96		7.70		0.21
6)Tri-Diag(below)	0.84		7.70		0.23
7)Equation of State	36.2		120.0		0.51
8)PDE Integration	5.16		55.4		0.40
9)Integral Predictors	30.4		68.0		0.43
10)Difference Pred.	4.33		36.0		0.20
11)First Sum		0.54		2.90		0.15
12)First Difference	13.2		25.0		0.17
13)Particle Pusher	0.85		4.0		0.09
14)1-D Particle Pusher	1.28		5.60		0.18

Note that these numbers were published by Alliant (not us!) about
8 months ago. They may have changed. I also don't know if the
VAX was a VMS system or a BSD 4.2 system.

I have seen our Apollo salesman carrying around FX/1 and FX/8
linterature, so you could probably get some from your local office
if you press them for it. The Alliant sales people will probably
be able to give you a better explanation of their machines, though.
They should also have some newer results, too.  When we were still
considering whether or not to buy our machine they were quite willing
to come out to MIT and give a seminar on their techniques and their
hardware. Running benchmarks of our own programs was something
they had no problem doing. We gave them a tape of VMS compatible
fortran programs and they called up and gave us the results (we
actually ran some of the shorter ones ourselves).

Right now what Apollo can offer you is the Alliant machine running
BSD 4.2 and its own Fortran compiler, and Apollo's ethernet gateway
software (FTP for sending ascii files and TELNET for remote logins).
We have written some programs which will reformat binary data files
(files created by a Fortran WRITE statement not using a FORMAT statement)
so that they can be sent between the machines (both Apollo and
Alliant use 68020 integer arithmetic and IEEE floating point 
arithmetic so only the record format needs to be twiddled). We have
heard that the future will provide us with transparent
file access and remote logins (ie. FTP file transfers will be 
unnecessary, you will access files across the network the same
way you do across the Apollo ringnet).

					-- David Krowitz
					 ( DAVID@MIT-MC.ARPA or
					   KROWITZ@MIT-MARIE@MIT-MC.ARPA )