Xref: utzoo comp.lang.fortran:4234 comp.lang.c:34419
Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!uakari.primate.wisc.edu!aplcen!boingo.med.jhu.edu!haven!adm!cmcl2!lanl!ttw
From: ttw@lanl.gov (Tony Warnock)
Newsgroups: comp.lang.fortran,comp.lang.c
Subject: Re: Fortran vs. C for numerical work (SUMMARY)
Message-ID: <7445@lanl.gov>
Date: 30 Nov 90 15:21:09 GMT
References: <1990Nov29.040910.7400@kithrup.COM> <7318@lanl.gov> <6493:Nov3006:03:3790@kramden.acf.nyu.edu>
Organization: Los Alamos Natl Lab, Los Alamos, N.M.
Lines: 55
>From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
>
>In article <7318@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>> The Crays have an integer multiply unit for addresses. This mult
>> takes 4 clocks.
>
>But isn't that only for the 24-bit integer? If you want to multiply full
>words you have to (internally) convert to floating point, multiply, and
>convert back.
>
>I have dozens of machines that can handle a 16MB computation; I'm not
>gonig to bother with a Cray for those. The biggest advantage of the Cray
>line (particularly the Cray-2) is its huge address space.
>
>So what's the actual time for multiplying integers?
>---Dan
The time for multiplying 32-bit integers on the YMP is 5 clock
periods. Normally YMP addresses are interpreted as 64-bit words
not as bytes. On the previous models of CRAYS, 24 bits are used to
address 16Mwords not Mbytes. (This saves 3 wires per address data
path As most work on CRAY's is done on words (numerical) or
packed-character strings, multiplication of longer integers is not
provided for in the hardware.
Personally I would like to have long integer support. The CRAY
architecture supports a somewhat strange multiplication method
which will yield a 48-bit product of the input words have total
length less than 48 bits. That is, one can multiply two 24-bit
quantities, a 16-bit and a 32-bit quantity, a 13-bit and a 35-bit
quantity, or shorter things. This operation takes two shifts and
one multiply. The shifts may be overlapped so the time is 3 clocks
for the two shifts and 7 clocks for the multiply if the shifts are
known; or 4 clocks for the shifts and 7 clocks for the multiply if
the shifts are variable. Its a bit of a pain to program but the
compiler does for us. Another form of integer multiplication is
used sometimes: the integers are converted to floating, then
multiplied, and the result converted back to integer. This method
fails if an intermediate value exceeds 46-bits of significance.
The time is 2 clocks for producing a "magic" constant, 3 clocks
each for two integer adds (reduces to 4 total because of
pipelining), 6 clocks each for two floating adds (reduces to 6
because of pipelining overlap with the integer add), 7 clocks for
the floating multiply, 6 clocks for another floating add, and 6
clocks for another integer multiply. Total is 29 clocks if no
other operations may be pipelined with these operations. If the
quantities being multiplied are addresses, some of the above is
eliminated, bringing the result down to 20 clocks. Still this is
not as good as the floating point performance. All of the above
may be vectorized which would result in 3 clocks per result in
vector mode.