Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!tut.cis.ohio-state.edu!ucbvax!biology.cambridge.ac.uk!ARCR1
From: ARCR1@biology.cambridge.ac.uk (Andy Raine)
Newsgroups: comp.sys.transputer
Subject: i860s and the like
Message-ID: <8439.9008081138@prg.oxford.ac.uk>
Date: 8 Aug 90 11:39:00 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 52

Dear netters,

Following the exhibition at TA90 at Southampton, and following previous 
discussion on the net, the following has occurred to me, and I offer it as 
a topic for debate:

Meiko, Transtech, Microway etc. etc. sell boards that have an intel i860 
procesor interfaced to one or two t800's.  INMOS have a TRAM which has a 
vector coprocessor attached to a t800.

All these boards offer what seems to be impressive cpu performance (albeit 
for calculations that can be expressed in terms of vector processing 
libraries, at the moment), but one thing bothers me: 

Consider:  Meiko claim that their board can do a 1024x32bit complex FFT in 
1.3 ms.  INMOS reckon their board will do the same in < 2.0 ms.

The data required for this calculation is 1024x4x2 = 8kBytes, which would 
take (Using Meiko's CStools communications) about 8ms to transfer from one 
transputer to another (using occam on a bare link would be faster, but 
certainly no less than 4 ms).  So if all 8 links of the Meiko board (it has 
2 t800s) were saturated, and if communications can be overlapped with 
calculations completely, then the vector processor would just about be busy all 
the time.  For the boards with only 4 links, than only 50% utilisation of the 
coprocessor would be expected as a maximum.  In realistic cases, data just 
wont be available to the processor at these maximum rates.

So what should we conclude?  Well, many problems are parallelisable, but 
efficient algorithms depend on minimising the time spent in communicating 
data.  The t800 has been referred to as a 'medium grain' processor, meaning 
that often a few tens of processors can be brought to bear efficiently on a 
particular problem.  The new vector/i860 boards are then 'coarse grain' 
processors, and for the same problem, the maximum number that can be used
efficiently will be smaller.

I suggest that what is needed for a large number of scientific calculations 
is a 'fine grain' processor.  In other words, if the ratio of communication 
speed to compute speed of the t800 is taken to be 1:1, then what is needed 
is a processor where the ratio is 10:1.  The i860 & vector boards achieve a 
ratio of 1:10 (The wrong way!), and the H1 transputer maintains the t800 
ratio at 1:1.

If a manufacturer produced boards with a t800 coupled with link driver 
hardware that ran at 10 times the speed of the t800's links, then I would 
be able to use ten times as many processors, and get 10 times the 
performance.  What about it?

OK, thats all.  I seem to have written quite a lot, but I dont want to 
appear to be trying to force my point of view down other people's throats, 
just to start a discussion.  So what do people think?

Andy Raine