Xref: utzoo comp.sys.super:389 comp.arch:23237 comp.parallel:2656 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!sdd.hp.com!caen!sol.ctr.columbia.edu!emory!hubcap!fpst From: rvdg@cs.utexas.edu (Robert van de Geijn) Newsgroups: comp.sys.super,comp.arch,comp.parallel Subject: Re: Massively Parallel LINPACK on the Intel Touchstone Delta machine Summary: From the horse's mouth: the anonymous benchmarker speaks Message-ID: <1991Jun13.203112.27776@hubcap.clemson.edu> Date: 13 Jun 91 18:37:32 GMT References: <1991Jun3.233741.8570@elroy.jpl.nasa.gov> <13301@pt.cs.cmu.edu> <1991Jun6.174129.25202@hubcap.clemson.edu> Sender: fpst@hubcap.clemson.edu (Steve Stevenson) Followup-To: comp.sys.super Organization: U. Texas CS Dept., Austin, Texas Lines: 54 Approved: parallel@hubcap.clemson.edu This article is in response to several articles posted recently to comp.parallel concerning the massively parallel LINPACK benchmark on the Intel Touchstone DELTA machine. Since I was the anonymous "benchmarker", I believe a few words from me may shed some light on the situation. I was contacted by Intel around the second week of May and asked to implement the LINPACK Benchmark on the DELTA. The initial goal was to beat the "world record" 5.2 GFLOPS (double precision), reported by Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31. While a version of the LU factorization had been developed at the University of Tennessee by Jack Dongarra and Susan Ostrouchov, this version assumes that the matrix is mapped to nodes by wrapping panels onto an embedded ring. Moreover, no satisfactory triangular solve had been implemented, nor had all pieces been put together. Finally, a quick "back-of-the-envelope" calculation indicated that this version did not scale well to a large machine like the DELTA. Starting from scratch, I implemented a block-torus wrapped version of the right-looking LAPACK LU factorization variant, as well as torus-wrapped variants of the Li-Coleman implementation of the triangular solve (also known as cyclic algorithms). The development was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving an impressive 1.92 GFLOPS. Two weeks later, on May 23, the machine attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem (during the first attempt on the DELTA). After adjusting storage buffer sizes, the same code produced the numbers reported in article 2643 of comp.parallel (8.561 GFLOPS). Next, the author went on a much needed vacation at Disneyland and Sea World. On June 5, Thinking Machine announced the CM-200, and a new record, 9.03 GFLOPS for a 28672 X 28672 problem. On June 6, my code attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA. It should be noted that the code that is being used was quickly put together in a matter of 2 weeks. There are many parameters that can be optimized, and the memory has not yet been exhausted. A 20K X 20K problem is completed in only 522.7 seconds. Moreover, there are still many standard techniques that can be used to reduce the communication overhead. 10.2 GFLOPS is only a start. I am working on a short report on this topic and I am in the process of completing the code. Intel has promised to release the code to the public domain, once it has been completed. Robert van de Geijn Assistant Professor The University of Texas at Austin -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell