Xref: utzoo comp.lang.fortran:4216 comp.lang.c:34376 Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!psuvax1!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Newsgroups: comp.lang.fortran,comp.lang.c Subject: Re: Fortran vs. C for numerical work (SUMMARY) Message-ID: <6690:Nov3006:15:3890@kramden.acf.nyu.edu> Date: 30 Nov 90 06:15:38 GMT References: <9458:Nov2721:51:5590@kramden.acf.nyu.edu> <2392:Nov2902:59:0590@kramden.acf.nyu.edu> <7339@lanl.gov> Organization: IR Lines: 36 Several of you have been missing the crucial point. Say there's a 300 to 1 ratio of steps through a matrix to random jumps. On a Convex or Cray or similar vector computer, those 300 steps will run 20 times faster. Suddenly it's just a 15-1 ratio, and a slow instruction outside the loop begins to compete in total runtime with a fast floating-point multiplication inside the loop. Anyone who doesn't think shaving a day or two off a two-week computation is worthwhile shouldn't be talking about efficiency. In article <7339@lanl.gov> ttw@lanl.gov (Tony Warnock) writes: > Model Multiplication Time Memory Latency > YMP 5 clock periods 18 clock periods > XMP 4 clock periods 14 clock periods > CRAY-1 6 clock periods 11 clock periods Um, I don't believe those numbers. Floating-point multiplications and 24-bit multiplications might run that fast, but 32-bit multiplications? Do all your matrices really fit in 16MB? > Compaq 25 clock periods 4 clock periods Well, that is a little extreme; I was talking about real computers. > For an LU > decompositon with partial pivoting, one does rougly N/3 constant > stride memory accesses for each "random" access. For small N, say > 100 by 100 size matrices or so, one would do about 30 > strength-reduced operations for each memory access. For medium > (1000 by 1000) problems, the ratio is about 300 and for large > (10000 by 10000) it is about 30000. And divide those ratios by 20 for vectorization. 1.5, 15, and 150. Hmmm. ---Dan