Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!cornell!batcomputer!kahn From: kahn@batcomputer.tn.cornell.edu (Shahin Kahn) Newsgroups: comp.arch Subject: Re: 1000000x1000000 Matrix (was: linpack) Message-ID: <9168@batcomputer.tn.cornell.edu> Date: 27 Oct 89 22:37:39 GMT References: <9089@batcomputer.tn.cornell.edu> <46500082@uxe.cso.uiuc.edu> <9118@batcomputer.tn.cornell.edu> <36553@lll-winken.LLNL.GOV> Reply-To: kahn@tcgould.tn.cornell.edu (Shahin Kahn) Organization: Cornell Theory Center, Cornell University, Ithaca NY Lines: 102 Here is a posting on the 1Mx1M matrix problem that I got from my friend Brad Carlile at FPS. It is a nice analysis and I have permission to post it for him. As users (i.e. Customers!) move to more realistic, 3D analysis, finer mesh sizes, more raelistic models, etc.. I think we will start seeing more problems like this. A super still has to do small problems fast, but its competitiveness shows better for these "real"istic problems! --------- From fpssun!@tektronix.tek.com:!bradc Fri Oct 27 16:34:25 1989 To: kahn@tcgould.TN.CORNELL.EDU Subject: Re: 1000000x1000000 Matrix (was: li In-Reply-To: your article <9125@batcomputer.tn.cornell.edu> Status: R Solving a 1M x 1M matrix, a very good problem indeed. First, lets consider storage. Storage of a real symmetric matrix = N(N+1)/2, where N is the size of the matrix. If the matrix was general than a storage of N*N is required. The below table will consider the number of bytes required to store the (1M x 1M) matrix in 32-bit and floating-point numbers. We will look at both Real Symmetric (RS) and Real non-symmetric (General). Storage(in bytes) RS General ----------------- ------- ------- 32-bit 2*10^12 4*10^12 64-bit 4*10^12 8*10^12 64-bit 4*10^12 8*10^12 At this point one can consider the number of address bits required for this storage. Address bits needed RS General ------------------- ------- ------- 32-bit 41 42 64-bit 42 43 At this size of matrix we are left with two options. First, we are outside the standard 32-bit address space of most computers. In the future the industry should consider using more address bits. Our second option is to solve the problem in an out-of-core manner. This option requires us to store most of the matrix on disk and bring in blocks of data to operate on (these blocks must be small enough to fit in the 32-bit address space, leaving room for the program, os, etc.). Even though it sounds like this may be a I/O limited problem, it is not. The reuse in the sub-matrix partitioning is enough to double buffer and hide most of the I/O to disk. Now lets turn to the computation. The problem has two phases: factor and solve. One thing not mentioned, or I missed it, was the number of vectors one is solving for in this problem. Flops needed RS General ------------------- ------- ------- Factor (N^3)/3 (2/3)*N^3 Solve 2*X*N^2 2*X*N^2 Where X is the number of vectors that one is solving for. Flops needed (1M*1M) RS General ------------------- ------- ------- Factor 3*10^17 6*10^17 Solve X*2*10^12 X*2*10^12 Factoring is the major computational user. The below table will estimate the time required to solve the problem on different speeds of machines. Machine RS ---------- -------- 100 MFlops 96.1 years 1 GFlops 9.52 years 10 GFlops 348 days 100 GFlops 34.7 days 1 TFlops 3.47 days I like problems like this they are fun. A couple of other notes. When you solve large matrices you are working with matrix multiply based kernels. The adder and the multiplier operations are balanced and this is a compute bound problem. It is also not memory bound like the 100*100 LINPACK benchmark. [Do not attempt this size of problem using the LINPACK 100 code :), I've always wanted to use of those ":)"] The real-symmetric case does not require pivoting for numerical stability. I would suggest doing this calculation in 64-bit, or larger, floating-point numbers. I hope I haven't made any errors in my calculations, I am doing this from memory. Brad Carlile