Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!sdd.hp.com!caen!math.lsa.umich.edu!sharkey!amara!mcdaniel From: mcdaniel@adi.com (Tim McDaniel) Newsgroups: comp.lang.c Subject: Re: Column-wise data storage for matrices in C, any advantage(s)? Message-ID: Date: 4 Feb 91 19:32:10 GMT References: <1991Feb1.214342.4982@portia.Stanford.EDU> Sender: news@adi.COM Distribution: na Organization: Applied Dynamics International, Inc.; Ann Arbor, Michigan, USA Lines: 43 In-reply-to: fangchin@elaine46.stanford.edu's message of 1 Feb 91 21:43:42 GMT In article <1991Feb1.214342.4982@portia.Stanford.EDU> fangchin@elaine46.stanford.edu (Chin Fang) writes: as is well know among numerical computing people that matrix algorithms have to be "gaxpy" (defined below) rich to gain high performance: Well, talk to the Center for Supercomputing Research and Development at the University of Illinois, for one (Kuck, Lawrie, Padua, et al). Or Ken Kennedy at Rice. Or Ron Cytron at IBM-Yorktown Heights. They would probably dispute this "well known" statement. gaxpy: given x in Rn, y in Rn and A in Rmxn, the following algorithm computes z=y+Ax. Where I came from, the convention was that an m*n matrix had m rows and n columns, and if x is in Rn and A is in Rm*n, then A*x is in Rm. In that case, "z=y+Ax" above makes no sense. (Also, I've seen the problem referred to as SAXPY, for the single-precision BLAS routine.) Now, note that this algorithm [omitted] requires most rapid changes in indexing occurring in COLUMNS, not rows! Then don't use THIS algorithm; use ANOTHER algorithm. Carefully distinguish "a problem" from "an algorithm". The problem is a goal; the algorithm is one means to that end. You're trying to solve a problem; any particular algorithm is simply one approach to the problem, to be discarded when conditions change. The first algorithm proposed is suited for FORTRAN; devise one more suited for C. Just optimizing matrix multiply for one given machine is hard. For a thorough job, I would suppose you should consider vector register length, number of registers, total cache size, cache line length, page size, memory latency, and function unit overlap, just for starters. No doubt there are other factors I haven't thought of. A very-high-performance supercomputer SAXPY should be in a hand-tuned library; the average expert is not familiar enough with the problem to code an excellent solution. -- Tim McDaniel Applied Dynamics Int'l.; Ann Arbor, Michigan, USA Work phone: +1 313 973 1300 Home phone: +1 313 677 4386 Internet: mcdaniel@adi.com UUCP: {uunet,sharkey}!amara!mcdaniel