Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin
From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
Newsgroups: comp.benchmarks
Subject: Re: more bc babble
Message-ID: <MCCALPIN.90Dec16090418@pereland.cms.udel.edu>
Date: 16 Dec 90 14:04:18 GMT
References: <1990Dec11.163826.5439@eagle.lerc.nasa.gov>
	<OLES.90Dec13213301@kelvin.uio.no>
Sender: usenet@ee.udel.edu
Organization: College of Marine Studies, U. Del.
Lines: 53
Nntp-Posting-Host: perelandra.cms.udel.edu
In-reply-to: oles@kelvin.uio.no's message of 13 Dec 90 20:33:01 GMT

>>>>> On 13 Dec 90 20:33:01 GMT, oles@kelvin.uio.no (Ole Swang) said:

Ole> Another easy-to-memorize benchmark is the computation of the sum
Ole> of the first 10 million terms in the harmonic series.
	[... code deleted ...]
Ole> This one is obviously testing floating-point perfomance only. The
Ole> emphasis on divisions might give biased results. It vectorizes
Ole> fully on the vectorizing compilers I've tested it on (Cray and Convex).
Ole> It has the advantage over the bc benchmark that it's the same code
Ole> every time.

Unfortunately, unless one's applications really spend all of their
time doing divides, this benchmark is going to have fairly limited
predictive capability.  The timing for the divide instruction is
rather variable between machines in ways that are not obviously
related to the timings for the add/subtract and multiply instructions.


Off the top of my head, here are some examples.  These are asymptotic
peak rates for vector operations in cycles per result for the
operation:
		a(i) = b(i)/c(i)

Machine		Divide cycles	Multiply cycles		ratio
----------------------------------------------------------------
Cray X/MP	   3N			N		  3
Cray 2		   4N			N		  4
ETA-10/Cyber 205   6N			N		  6
IBM 3090/VF	  13N		       3N ?		 (4)
IBM RS/6000	  20N		       3N		  7 *
----------------------------------------------------------------

I don't recall any other numbers right now, and I certainly won't
guarantee that the above numbers are precisely correct, but it does
give you some idea of the trouble.

? It is too early in the morning for me to remember the details of
what is overlappable on the 3090/VF.  Here I assume that the multiply
can be overlapped with one of the loads.  Since there is only one
load-store unit, that leaves to more cycles for the other load and the
store.

* Note that the RS/6000 would only require 2N cycles for the equivalent
multiplies except for the need to store a(i), which cannot be
overlapped with either of the loads or the multiply.

It would be especially interesting to add Intel i860 numbers to that
table, since the i860 does not have full FP divide hardware and must
iterate to get an IEEE-compliant result.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET