Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!batcomputer!kahn
From: kahn@batcomputer.tn.cornell.edu (Shahin Kahn)
Newsgroups: comp.arch
Subject: Re: 1000000x1000000 Matrix (was: linpack)
Message-ID: <9168@batcomputer.tn.cornell.edu>
Date: 27 Oct 89 22:37:39 GMT
References: <9089@batcomputer.tn.cornell.edu> <46500082@uxe.cso.uiuc.edu> <9118@batcomputer.tn.cornell.edu> <36553@lll-winken.LLNL.GOV>
Reply-To: kahn@tcgould.tn.cornell.edu (Shahin Kahn)
Organization: Cornell Theory Center, Cornell University, Ithaca NY
Lines: 102


Here is a posting on the 1Mx1M matrix problem that I got from 
my friend Brad Carlile at FPS.  It is a nice analysis and
I have permission to post it for him.  

As users (i.e. Customers!) move to more realistic, 3D analysis,
finer mesh sizes, more raelistic models, etc..  I think we will
start seeing more problems like this.  A super still has to do
small problems fast, but its competitiveness shows better for
these "real"istic problems!
---------

 From fpssun!@tektronix.tek.com:!bradc Fri Oct 27 16:34:25 1989
 To: kahn@tcgould.TN.CORNELL.EDU
 Subject: Re: 1000000x1000000 Matrix (was: li
 In-Reply-To: your article <9125@batcomputer.tn.cornell.edu>
 Status: R

 Solving a 1M x 1M matrix, a very good problem indeed.

 First, lets consider storage.

 Storage of a real symmetric matrix = N(N+1)/2, where N is the size of
 the matrix.  If the matrix was general than a storage of N*N is
 required.  The below table will consider the number of bytes required
 to store the (1M x 1M) matrix in 32-bit and floating-point numbers.
 We will look at both Real Symmetric (RS) and Real
 non-symmetric (General).

 Storage(in bytes)     RS        General
 -----------------  -------      -------
 32-bit             2*10^12      4*10^12
 64-bit             4*10^12      8*10^12
 64-bit             4*10^12      8*10^12
 
 At this point one can consider the number of address bits required for
 this storage.  
 
 Address bits needed     RS        General
 -------------------  -------      -------
 32-bit                  41          42
 64-bit                  42          43
 
 At this size of matrix we are left with two options.  First, we are
 outside the standard 32-bit address space of most computers.  In the
 future the industry should consider using more address bits.  Our second
 option is to solve the problem in an out-of-core manner.  This option
 requires us to store most of the matrix on disk and bring in blocks of
 data to operate on (these blocks must be small enough to fit in the
 32-bit address space, leaving room for the program, os, etc.).  Even
 though it sounds like this may be a I/O limited problem, it is not.  The
 reuse in the sub-matrix partitioning is enough to double buffer and hide
 most of the I/O to disk.
 
 Now lets turn to the computation.
 
 The problem has two phases: factor and solve.  One thing not mentioned,
 or I missed it, was the number of vectors one is solving for in this
 problem.  
 
 Flops needed            RS        General 
 -------------------  -------      -------
 Factor               (N^3)/3     (2/3)*N^3 
 Solve                2*X*N^2      2*X*N^2 
 
 Where X is the number of vectors that one is solving for.
 
 Flops needed (1M*1M)    RS        General 
 -------------------  -------      -------
 Factor               3*10^17      6*10^17
 Solve               X*2*10^12    X*2*10^12   
 
 Factoring is the major computational user.  The below table will
 estimate the time required to solve the problem on different speeds of
 machines.
 
 Machine                 RS
 ----------           --------
 100 MFlops           96.1 years
 1 GFlops             9.52 years
 10 GFlops            348 days
 100 GFlops           34.7 days
 1 TFlops             3.47 days
 
 I like problems like this they are fun.
 
 A couple of other notes.
 
 When you solve large matrices you are working with matrix multiply based
 kernels.  The adder and the multiplier operations are balanced and this
 is a compute bound problem.  It is also not memory bound like the
 100*100 LINPACK benchmark.  [Do not attempt this size of problem using 
 the LINPACK 100 code :), I've always wanted to use of those ":)"]
 The real-symmetric case does not require pivoting for numerical
 stability.  I would suggest doing this calculation in 64-bit, or
 larger, floating-point numbers. 
 
 I hope I haven't made any errors in my calculations, I am doing this
 from memory.
 
 Brad Carlile