Xref: utzoo comp.sys.super:389 comp.arch:23237 comp.parallel:2656
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!sdd.hp.com!caen!sol.ctr.columbia.edu!emory!hubcap!fpst
From: rvdg@cs.utexas.edu (Robert van de Geijn)
Newsgroups: comp.sys.super,comp.arch,comp.parallel
Subject: Re: Massively Parallel LINPACK on the Intel Touchstone Delta machine
Summary: From the horse's mouth: the anonymous benchmarker speaks
Message-ID: <1991Jun13.203112.27776@hubcap.clemson.edu>
Date: 13 Jun 91 18:37:32 GMT
References: <1991Jun3.233741.8570@elroy.jpl.nasa.gov> <13301@pt.cs.cmu.edu> <1991Jun6.174129.25202@hubcap.clemson.edu>
Sender: fpst@hubcap.clemson.edu (Steve Stevenson)
Followup-To: comp.sys.super
Organization: U. Texas CS Dept., Austin, Texas
Lines: 54
Approved: parallel@hubcap.clemson.edu


This article is in response to several articles posted recently to
comp.parallel concerning the massively parallel LINPACK benchmark on
the Intel Touchstone DELTA machine.  Since I was the anonymous
"benchmarker", I believe a few words from me may shed some light on
the situation.

I was contacted by Intel around the second week of May and asked to
implement the LINPACK Benchmark on the DELTA.  The initial goal was to
beat the "world record" 5.2 GFLOPS (double precision), reported by
Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31.
While a version of the LU factorization had been developed at the
University of Tennessee by Jack Dongarra and Susan Ostrouchov, this
version assumes that the matrix is mapped to nodes by wrapping panels
onto an embedded ring.  Moreover, no satisfactory triangular solve had
been implemented, nor had all pieces been put together.  Finally, a
quick "back-of-the-envelope" calculation indicated that this version
did not scale well to a large machine like the DELTA.

Starting from scratch, I implemented a block-torus wrapped version of
the right-looking LAPACK LU factorization variant, as well as
torus-wrapped variants of the Li-Coleman implementation of the
triangular solve (also known as cyclic algorithms).  The development
was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving
an impressive 1.92 GFLOPS.  Two weeks later, on May 23, the machine
attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem
(during the first attempt on the DELTA).

After adjusting storage buffer sizes, the same code produced the
numbers reported in article 2643 of comp.parallel (8.561 GFLOPS).
Next, the author went on a much needed vacation at Disneyland and Sea
World.  On June 5, Thinking Machine announced the CM-200, and a new
record, 9.03 GFLOPS for a 28672 X 28672 problem.  On June 6, my code
attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA.

It should be noted that the code that is being used was quickly put
together in a matter of 2 weeks.  There are many parameters that can
be optimized, and the memory has not yet been exhausted.  A 20K X 20K
problem is completed in only 522.7 seconds.  Moreover, there are still
many standard techniques that can be used to reduce the communication
overhead.  10.2 GFLOPS is only a start.

I am working on a short report on this topic and I am in the process
of completing the code.  Intel has promised to release the code to the
public domain, once it has been completed.

Robert van de Geijn
Assistant Professor
The University of Texas at Austin


-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell