Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!gem.mps.ohio-state.edu!ctrsol!ginosko!usc!bufo.usc.edu!vorbrueg
From: vorbrueg@bufo.usc.edu (Jan Vorbrueggen)
Newsgroups: comp.arch
Subject: Re: parallel systems
Message-ID: <20764@usc.edu>
Date: 24 Oct 89 03:26:50 GMT
Sender: news@usc.edu
Reply-To: vorbrueg@bufo.usc.edu (Jan Vorbrueggen)
Organization: University of Southern California, Los Angeles, CA
Lines: 36

In article <36597@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:

>Given equivalent performance interconnect, which rarely occurs because the
>message passing machines tend to get short changed on the comm. hardware,
>I have found the "shared memory" systems to have much better communication
>performance.  This is because the communication between processors is
>directly supported in the memory management hardware.  In the message passing
>machines sending a message invokes a "kernel call" on both the sending and
>recieving ends.  This system call overhead is much greater than the hardware
>latency itself, ammounting to a factor 5 or more.  One could try for complex
>hardware support of messaging, but a better solution is to just memory map it.
>
>Please note:  I am not talking about the really horrible interrupt handling
>of message forwarding here.  This only compounds a bad situation for kernel
>overhead.

Eugene, ever seen a transputer? Overhead for receiving or sending a 
message is 19 cycles (630 ns for a 30 MHz part). The actual transfer
is done by a dedicated DMA machine at a maximum rate of 1.7 Mbyte/s
unidirectional or 2.4 MByte/s bidirectional. At 4 links/transputer
this gives 9.6 Mbytes/s, close to what most memory interfaces will
allow. Of course, very short messages will limit your transfer rate;
however, at 128 Bytes/message you see about 80% of the maximum rate.
There is no system call involved - the compiler just generates the 
necessary instruction.

Message forwarding isn't so difficult either. I've read of a system
requiring less than 10 us overhead per through-route (this probably
is for the destination link being available). No interrupt handling
involved here - that part is all handled in hardware.

Next generation (i.e., promised for start of 1991) will have 100 Mbit/s
per link and the possibility of hardware routing (a la wormhole).
The cpu will be faster by factor of 4 or so and a memory bandwith to match.

-- Jan Vorbrueggen