Xref: utzoo comp.sys.super:365 comp.arch:23088 comp.parallel:2618
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!rphroy!caen!sol.ctr.columbia.edu!emory!hubcap!fpst
From: baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh)
Newsgroups: comp.sys.super,comp.arch,comp.parallel
Subject: Re: Massively Parallel LINPACK on the Intel Touchstone Delta machine
Message-ID: <1991Jun5.120653.7852@hubcap.clemson.edu>
Date: 4 Jun 91 22:23:51 GMT
References: <1991Jun3.130104.15667@hubcap.clemson.edu> <1991Jun3.233741.8570@elroy.jpl.nasa.gov>
Sender: news%ssd.intel.com@RELAY.CS.NET
Followup-To: comp.sys.super
Organization: Supercomputer Systems Division, Intel Corp.
Lines: 83
Approved: parallel@hubcap.clemson.edu
Nntp-Posting-Host: medusa


In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes:
>[original post deleted ... Jerry Baugh]
>
>At first, I started to read this thinking "what the heck does LINPACK
>have to do with the performance of a parallel computer other than
>measuring the power of individual nodes?"  Then I started reading more
>closely, and it appears that there's more to it than that.
>
>Can someone explain how "massively parallel LINPACK" is different from
>regular LINPACK?  

Since I started this, I guess I ought to take a stab at explaining this
as well.  The standard LINPACK benchmark is constrained to 100x100 and
1000x1000.  These problem sizes are relatively small when compared with
the capabilites of todays supercomputers.  And they are small compared
to the size problems people are solving today (a production code at
one of our customer sites does a 20k x 20k matrix multiply).  Recently,
Jack Dongarra extended his standard LINPACK benchmark suite to include
a new catagory "Massively Parallel Computing".  This new catagory allows
the problem size to increase up to and including the limit of physical
memory.  Since the problem size can vary, it is important to note the
size (presented as order of the matrix) as well as the FLOP numbers presented.
One of the reasons we are proud of our numbers here at Intel is that
we got a high GFLOP number with a relatively small problem size (we can
still increase the problem size as we have not yet run out of physical
memory).

>What considerations for communication are made in this benchmark?  

Both data decomposition and data communication are critical.  The
method of parallelizing a program that is chosen must take both into
account. Selecting one method of data decomposition may dictate the
data communication strategy.  The strategy used in this case involved
a 2D data decomposition that does both column and row mapping.  This
takes advantage of the Touchstone Delta's mesh architecture by allowing
messages to flow to all nearest neighbors.  And this leads to
higher performance than a 1D decomposition scheme.

>Since LINPACK is normally used as a measure of number crunching,
>I'm curious how this benchmark translates to parallel computers.  

This particular version was designed for parallel computers.

>As we all know (or we all should know), the performance
>of a parallel computer is usually NOT the same as the performance of an
>individual node multiplied by the number of nodes in the computer
>(although we'd just love that to always be the case).
>The obvious misapplication of this kind of benchmark would
>be to multiply a single node's LINPACK performance by the number of nodes in
>the machine.  I notice that the numbers posted in the above-referenced
>article do not scale linearly with the number of nodes used, so there
>is some efficiency loss from the single node case.  I'm itching to
>find out what the source of this loss is.  
 
Since the data values are distributed across multiple nodes, data must be
passed from one node (where it resides) to another node (where it is needed
for the calculation).  This communication cost is a major part of the
difference you see between the single node numbers and the 512 node
numbers.
 
>This is of particular interest
>as I am currently porting some existing parallel code to the Delta, and I'd
>like to be able to handle the inevitable queries about "well, they say
>the Delta does such-and-such LINPACK GFLOPS...".

Personally, I have always said that performance on real codes is important.
The "Massively Parallel LINPACK" benchmark provides one measure that is
meaningful for some classes of codes.  For other classes of programs
(database for example), it is meaningless.
 
>Any explanation or references would be welcome.

Hope what I've written helps.  For more information on LINPACK and the
"Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper
'Performance of Various Computers Using Standard Linear Equations Software'.
For an explaination of a parallel 1D decomposition, try 'LAPACK Block
Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24,
University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov.

>Steve Groom, Jet Propulsion Laboratory, Pasadena, CA
>stevo@elroy.jpl.nasa.gov  {ames,usc}!elroy!stevo

Jerry Baugh
Intel SSD - baugh@SSD.intel.com