Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/17/84; site think.ARPA Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!ihnp4!think!bradley From: bradley@think.ARPA (Bradley C. Kuszmaul) Newsgroups: net.arch,net.math Subject: Re: topology of highly connected computer systems Message-ID: <1483@think.ARPA> Date: Tue, 30-Apr-85 10:01:13 EDT Article-I.D.: think.1483 Posted: Tue Apr 30 10:01:13 1985 Date-Received: Wed, 1-May-85 03:45:10 EDT References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> Reply-To: bradley@think.UUCP (Bradley C. Kuszmaul) Distribution: net Organization: Thinking Machines, Cambridge, MA Lines: 205 Xref: watmath net.arch:1132 net.math:1966 to: bradley, brooks@lll-crg.ARPA Summary: I am going to disagree a lot with brooks@lll-crg.ARPA, but don't take it personally. It sounds like Mr. Brooks has given the topic a lot of thought, and most of our differences seem to come from different assumptions. Once we start with different assumptions, its hard to get the same conclusions. Some of my assumptions are: - Lots and lots and lots of small processors are better than fewer big processors. - Any particular connection topology will be not quite right for most applications, so we might be willing to sacrifice the best case behavior for more generality or better average case behavior (for example nearest neighbor communication in a 2 dimensional grid is very good, but not very general. A cube is well connected, but may use too many wires to achieve its bandwidth) In article <551@lll-crg.ARPA> brooks@lll-crg.ARPA (Eugene D. Brooks III) writes: >> out, but for something like the cosmic cube (64 processors?) we are only >> talking about 192 wires, (probably expensive coaxial wires), which is >> not too much of a pain to actually construct in 3-space. > >The 64 processor was wired in a single backplane with wire wrap wire and >not coaxial cable. The wiring was not even twisted pairs. Of course it >probably would have been a lot less flakey with twisted pairs and appropo >drivers. Sorry about that. I was under the impression from the discussion in this group that the cosmic cube's connections are implemented with ethernet controllers. My main point was that a 6-cube is easy to wire up, and it only gets easier by using smaller wires. >> I am much more interested in what happens as $n$ becomes much larger >> (e.g. $n=10^6$).> >A million processors not not a problem at all (Well perhaps I am stretching it >a bit. I don't actually advocate a million processors. I thing more along the >lines of a few hundred 20 MIP/MFLOP processors as being interesting.). I beg to differ. I think that a million processors is not nearly enough. I believe that our differences probably stem from different approaches to mapping applications onto the parallel processors. If I were doing something like electrical simulation of a VLSI circuit, I would like to have each processor simulate a single transistor. If I am doing finite element analysis, each processor might simulate one of the elements. While it is true that by speeding up the processors, and using fewer of them, we should be able to get the same performance on some applications, I tend to think that it is possible to get more MIPS per dollar by using smaller, cheaper processing elements. >> The wiring problem becomes very hard again. The >> Connection Machine with 64K processors in the prototype faces an >> engineering challenge on this front. >How to >construct a very large N cube can easily be worked out on the back of an >envelope. You accomplish the feat by building and wiring the machine in 3 >dimensions. 32k processors is a piece of cake (its only 32 processors on a >side) and one million processors is within the realm of possibility. I don't see how wiring a 3d cube with 32 processors on a side gives us a 15 cube. It is important to realize that if you bisect an N cube "through" one of its dimensions (i.e. pick some $0<=i> if some node or wire fails, the machine might be able to keep running. >> (fault tolerence is important for a million processor machine)). >> Redundancy is also nice because it means that if lots of messages want >> to go from A to B, then we may be able to send more than one at a time >> through the net. >Deadlock free routing requires that a specific path be used for each >sender/reciver pair. We probably are thinking of routing from different points of view. I don't believe that deadlock free routing requires a specific path be used for each pair. proof: Suppose we have a cube connected packet switched network with enough storage at each vertice to store on message from every other vertice (usenet in the year 2010?). Suppose that we further place the requirement that no pe (= vertice) is allowed to inject a message into the network unless its previously injected messages have already been delivered. (We might require that every processor which wants to send a message do it at the same time, and then wait for the network to empty out before we send any more messages). There will be no deadlock, since even if all the messages want to go to the same place, there will be enough storage for them. >> One bad thing about cube topologies (at least for cubes bigger than >> about 10 dimensions) is that they do not use their bandwidth very >> effectively. It has been claimed (with proof - references may be >> available on demand) that a 14 cube sending an "average" set of messages >> only uses about one fourth of the bandwidth because of collisions. >I would like to see the reference list, please send it to me via email. Ok, this may take a second... >By the same token you might send my your SNAIL MAIL address if you are >interested in a paper describing a solution to the bandwidth problem. >Request the paper titled "Shared Memory Hypercube". Please send the paper to Bradley C. Kuszmaul Thinking Machines Corporation 245 First Street Cambridge, MA 02142 >The switching technque gives NlogN bandwidth, tested by simulation to N=1024, >and it adaptively removes conflicts. Ah, but there are on the order of 2^N wires in the machine. Sounds like the wires are not being used very effectively. > My use of the network is to create a >shared memory server for a hypercube machine with CRAY style vector processors. >With simultaneous vector access by all processors the packet conflict problem >is as severe as it can be. The network provides FULL bandwidth to all >processors simultaneously. It sounds to me like the packet conflict problem is as nice as it could possibly be for that problem. (Presumably processor number $i$ is accessing memory $f(i)$ where $f$ is one-to-one. What happens if $f$ is not one-to-one (e.g. if $f(i)=c$ for some constant $c$.) If $f$ is always the same function, I would just use local memory and flush the topology. Thus, for your application, $f$ must be a different function on different memory accesses. How many different functions $f$ are there that you can get the bandwidth? I doubt that you can do it for every one-to-one function. Random thoughts and questions: Why do you have shared memory? Presumably this is so that processors can communicate: Why don't you just use the cube network to communicate directly? What is the start up time for your vectors (i.e. how big does a vector have to be before the vector processing part wins over the scalar processing part.) Typical vector processors are limited in their speed by lack of memory bandwidth (this is true for a single processor with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved memory in order to try to increase the memory bandwidth). I don't see how putting a cube topology between the memory and the processor, and adding processors helps this situation. Is it the case that your vector processors are an order of magnitude slower than a CRAY (You mention 20MFLOPS earlier in the article) >> It has been claimed, in this discussion, that cube topologies are nice >> because they can simulate other topologies easily. Actually, this is >> not really a good argument, since most of the other seriously considered >> topologies are equally good at simulating each other >The cube topology is nice as it has other useful topologies IMBEDDED within it. >We are not talking about easy simulation here, efficient execution is the >target. The wires is there! Too many wires, and not very effectivley used. Suppose you spent the same amount of money on a different network which gave you twice the bandwidth for most of your applications. Would the N-cube still make sense? The nice thing about the TOTALLY connected topology is that EVERY other topology is embedded within it, but no one seriously considers wiring up a totally conneccted network with a million processors in it. (10^12 wires?) >> I suspect that the real reason that people are building N-cubes instead >> of skewed rings, minimum spanning trees, or some other topology, is that >> computer scientists have a bias towards binary representations of >> things, and the cube topology is nice and binary, and because routing is >> so easy. >Admittedly the routing is easy, this is one of the reasons for the cubes >attractiveness. The real reason for the interest in it however is it has >a genuine bandwidth edge over the other topologies. That is not really true. One of the topologies mentioned in this discussion (somewhere in an earlier posting) is the $k$-shuffle. This network achieves better bandwidth by decreasing the distance between nodes in the graph, by using the same degree of vertices, and using about the same number of wires as a cube. $k$-shuffles are just harder to understand: Write your address in base $k$. Now take this address and shift it left one digit (i.e. $\log k$ bits) Now add any number from $0$ to $k-1$ to the number, and your original processor is connected to that processor. If there are $n$ processors in the system, then the maximum distance between two processors is $log n / log k$. For 64K procesors in a 16-shuffle (which uses the same number of wires as a 64K 16-cube), the maximum distance between two processors is 4 jumps (compared to 16 jumps for the cube). This means that there are likey to be only a quarter as many conflicts, which increases the bandwidth. If you actually had some problem which only did nearest neighbor communication in an n-cube, we could run your program almost as quickly on the k-shuffle topology, because at worst there would be a four processor jump to perform your communication (instead of one jump). Very few programs actually use the n-cube directly (although there are a few (e.g. subcube induction to count the number of processors in a given state - [D.P. Christman, "Programming the Connection Machine", MIT masters thesis, January, 1984])) And for most programs the $k$-shuffle will be at least as good. Routing can be tricky for a $k$-shuffle. There are topologies with even better bandwidth which do have easy routing. Happy connections. -Bradley -- bradley@think.uucp (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley) bradley@THINK.ARPA