Xref: utzoo comp.lang.fortran:4216 comp.lang.c:34376
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!psuvax1!rutgers!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.lang.fortran,comp.lang.c
Subject: Re: Fortran vs. C for numerical work (SUMMARY)
Message-ID: <6690:Nov3006:15:3890@kramden.acf.nyu.edu>
Date: 30 Nov 90 06:15:38 GMT
References: <9458:Nov2721:51:5590@kramden.acf.nyu.edu> <2392:Nov2902:59:0590@kramden.acf.nyu.edu> <7339@lanl.gov>
Organization: IR
Lines: 36

Several of you have been missing the crucial point.

Say there's a 300 to 1 ratio of steps through a matrix to random jumps.
On a Convex or Cray or similar vector computer, those 300 steps will run
20 times faster. Suddenly it's just a 15-1 ratio, and a slow instruction
outside the loop begins to compete in total runtime with a fast
floating-point multiplication inside the loop.

Anyone who doesn't think shaving a day or two off a two-week computation
is worthwhile shouldn't be talking about efficiency.

In article <7339@lanl.gov> ttw@lanl.gov (Tony Warnock) writes:
>       Model        Multiplication Time     Memory Latency
>       YMP          5  clock periods         18 clock periods
>       XMP          4  clock periods         14 clock periods
>       CRAY-1       6  clock periods         11 clock periods

Um, I don't believe those numbers. Floating-point multiplications and
24-bit multiplications might run that fast, but 32-bit multiplications?
Do all your matrices really fit in 16MB?

>       Compaq       25 clock periods         4  clock periods

Well, that is a little extreme; I was talking about real computers.

> For an LU
>     decompositon with partial pivoting, one does rougly N/3 constant
>     stride memory accesses for each "random" access. For small N, say
>     100 by 100 size matrices or so, one would do about 30
>     strength-reduced operations for each memory access. For medium
>     (1000 by 1000) problems, the ratio is about 300 and for large
>     (10000 by 10000) it is about 30000.

And divide those ratios by 20 for vectorization. 1.5, 15, and 150. Hmmm.

---Dan