Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!linus!decvax!cca!ima!pbear!peterb
From: peterb@pbear.UUCP
Newsgroups: net.arch
Subject: Re: Cube designs
Message-ID: <48@pbear.UUCP>
Date: Sun, 17-Feb-85 02:35:30 EST
Article-I.D.: pbear.48
Posted: Sun Feb 17 02:35:30 1985
Date-Received: Wed, 20-Feb-85 07:31:07 EST
Lines: 72
Nf-ID: #N:pbear:22800001:000:4098
Nf-From: pbear!peterb    Feb 16 01:44:00 1985


	I have not looked deeply into the design of the 'Cube' as people
have called it. (i.e. taking a factor of n**3 CPU/MEM and hooking them up in
parrellel to accomplish a given task), so bear with me.

	First of all, rating the speed on any highly parallel system is
difficult in the least. You have to take your benchmarks in stride. If
I have a matrix problem that can be decomposed into two semiindependent
processes, then a VAX/11-782 would execute that program about twice as fast
as a VAX/11-780. But on the other hand if the program to be benchmarked
is highly sequential in nature (i.e. nth order numerical analysis of
differential equations) then the 782 and the 780 are going to run at about
the same speed. This applies to any parallel architecture.
	So a new standard of speed measure is required. I think
that something along the lines of Data Flow Operations/Second (DFO's
-or- Doofoh's) would fit the bill to benchmark these types of machines.
Then if you take the Cube and the Cray and put them both on the same scale
that reflects the architecures then you can compare speeds, otherwise you
are comparing apples to oranges.

	Second, any type of parallel machine relies heavily upon the
distrubution of data from one machine to another. This figures into the
overall speed of the machine since it is the exchange of data between
computing devices that drives any type of parallel architecture. This can be
a high/low speed type of architecture such as an ethernet(serial) or a
backplane(parallel) or even a combination of the two(i.e. eight processing
units on a backplane with a serial line to connect to other backplanes).
This was proved by Cm* created at CMU. Their data showed a severe OVERALL
system data transfer degradation as the amount of non/local I/O increased.
This is obvious to almost everybody. Cm* was limited only by the speed of
its backplane.

	Third, some type of control facility has to run the entire mess.
This can be slower than the other elements since it does not require the
massive data troughput of a processing element but still must have a
clean/quick architecture that lends itself to controling "devices" in a
quick and clean manner. Some form of overgrown/homebrew bit-slice seems
optimal in this situation since some instructions have to be general enough
for scheduling algorithms, processing I/O, feilding interupts, etc... but
quick and clean enough to service the resource request of each data element.
Whether this control facility is distributed or singular is up in debate
these days.  Different groups have differenet ideas regarding this.

	The idea of a cube is nice, and I think that it is about the fastest
architecture around for what it is designed for, but in no way will it
compete with a Cray at sequential MIPS. In parallel MIPS the cube would have
to be large, but the size would be managable.

	In order to increase/control data throughput, I think that a bus
architecure that combines the best of serial and parallel is in order.  I
think that each processing unit be hooked to three busses, one for the X
direction, one for the Y direction, and one for the Z direction. This would
require 3n**2 (n = size of cube on one side) busses each with n elements on
it. (i.e. for 8 data elements(a cube 2 on each side) requires 4 busses in the
X direction, 4 busses in the Y direction and 4 busses in the Z
direction(giving a total of 12 data busses). There would be a total of 3n**2
buss connections within the cube. But the advantage of this is that data (at
the most) has to pass through one data element on its way from source to
destination. Other paths can be created to get the data from node to node,
especially if each buss connection had a fifo on it to queue up transfers.
Also the data element can pass the information along from one buss to another
with very little overhead.

	I know this rough, but if the net kicks around the idea, we may all
one day(as a collective group) file for a patent (but I doubt it...)


						Peter Barada
						ima!pbear!peterb


PS      "its a long day, and it ain't going any faster..."