Xref: utzoo comp.sys.super:365 comp.arch:23088 comp.parallel:2618 Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!rphroy!caen!sol.ctr.columbia.edu!emory!hubcap!fpst From: baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) Newsgroups: comp.sys.super,comp.arch,comp.parallel Subject: Re: Massively Parallel LINPACK on the Intel Touchstone Delta machine Message-ID: <1991Jun5.120653.7852@hubcap.clemson.edu> Date: 4 Jun 91 22:23:51 GMT References: <1991Jun3.130104.15667@hubcap.clemson.edu> <1991Jun3.233741.8570@elroy.jpl.nasa.gov> Sender: news%ssd.intel.com@RELAY.CS.NET Followup-To: comp.sys.super Organization: Supercomputer Systems Division, Intel Corp. Lines: 83 Approved: parallel@hubcap.clemson.edu Nntp-Posting-Host: medusa In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes: >[original post deleted ... Jerry Baugh] > >At first, I started to read this thinking "what the heck does LINPACK >have to do with the performance of a parallel computer other than >measuring the power of individual nodes?" Then I started reading more >closely, and it appears that there's more to it than that. > >Can someone explain how "massively parallel LINPACK" is different from >regular LINPACK? Since I started this, I guess I ought to take a stab at explaining this as well. The standard LINPACK benchmark is constrained to 100x100 and 1000x1000. These problem sizes are relatively small when compared with the capabilites of todays supercomputers. And they are small compared to the size problems people are solving today (a production code at one of our customer sites does a 20k x 20k matrix multiply). Recently, Jack Dongarra extended his standard LINPACK benchmark suite to include a new catagory "Massively Parallel Computing". This new catagory allows the problem size to increase up to and including the limit of physical memory. Since the problem size can vary, it is important to note the size (presented as order of the matrix) as well as the FLOP numbers presented. One of the reasons we are proud of our numbers here at Intel is that we got a high GFLOP number with a relatively small problem size (we can still increase the problem size as we have not yet run out of physical memory). >What considerations for communication are made in this benchmark? Both data decomposition and data communication are critical. The method of parallelizing a program that is chosen must take both into account. Selecting one method of data decomposition may dictate the data communication strategy. The strategy used in this case involved a 2D data decomposition that does both column and row mapping. This takes advantage of the Touchstone Delta's mesh architecture by allowing messages to flow to all nearest neighbors. And this leads to higher performance than a 1D decomposition scheme. >Since LINPACK is normally used as a measure of number crunching, >I'm curious how this benchmark translates to parallel computers. This particular version was designed for parallel computers. >As we all know (or we all should know), the performance >of a parallel computer is usually NOT the same as the performance of an >individual node multiplied by the number of nodes in the computer >(although we'd just love that to always be the case). >The obvious misapplication of this kind of benchmark would >be to multiply a single node's LINPACK performance by the number of nodes in >the machine. I notice that the numbers posted in the above-referenced >article do not scale linearly with the number of nodes used, so there >is some efficiency loss from the single node case. I'm itching to >find out what the source of this loss is. Since the data values are distributed across multiple nodes, data must be passed from one node (where it resides) to another node (where it is needed for the calculation). This communication cost is a major part of the difference you see between the single node numbers and the 512 node numbers. >This is of particular interest >as I am currently porting some existing parallel code to the Delta, and I'd >like to be able to handle the inevitable queries about "well, they say >the Delta does such-and-such LINPACK GFLOPS...". Personally, I have always said that performance on real codes is important. The "Massively Parallel LINPACK" benchmark provides one measure that is meaningful for some classes of codes. For other classes of programs (database for example), it is meaningless. >Any explanation or references would be welcome. Hope what I've written helps. For more information on LINPACK and the "Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper 'Performance of Various Computers Using Standard Linear Equations Software'. For an explaination of a parallel 1D decomposition, try 'LAPACK Block Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24, University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov. >Steve Groom, Jet Propulsion Laboratory, Pasadena, CA >stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo Jerry Baugh Intel SSD - baugh@SSD.intel.com