Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!udel!nigel.ee.udel.edu!mccalpin From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin) Newsgroups: comp.benchmarks Subject: Re: more bc babble Message-ID: Date: 16 Dec 90 14:04:18 GMT References: <1990Dec11.163826.5439@eagle.lerc.nasa.gov> Sender: usenet@ee.udel.edu Organization: College of Marine Studies, U. Del. Lines: 53 Nntp-Posting-Host: perelandra.cms.udel.edu In-reply-to: oles@kelvin.uio.no's message of 13 Dec 90 20:33:01 GMT >>>>> On 13 Dec 90 20:33:01 GMT, oles@kelvin.uio.no (Ole Swang) said: Ole> Another easy-to-memorize benchmark is the computation of the sum Ole> of the first 10 million terms in the harmonic series. [... code deleted ...] Ole> This one is obviously testing floating-point perfomance only. The Ole> emphasis on divisions might give biased results. It vectorizes Ole> fully on the vectorizing compilers I've tested it on (Cray and Convex). Ole> It has the advantage over the bc benchmark that it's the same code Ole> every time. Unfortunately, unless one's applications really spend all of their time doing divides, this benchmark is going to have fairly limited predictive capability. The timing for the divide instruction is rather variable between machines in ways that are not obviously related to the timings for the add/subtract and multiply instructions. Off the top of my head, here are some examples. These are asymptotic peak rates for vector operations in cycles per result for the operation: a(i) = b(i)/c(i) Machine Divide cycles Multiply cycles ratio ---------------------------------------------------------------- Cray X/MP 3N N 3 Cray 2 4N N 4 ETA-10/Cyber 205 6N N 6 IBM 3090/VF 13N 3N ? (4) IBM RS/6000 20N 3N 7 * ---------------------------------------------------------------- I don't recall any other numbers right now, and I certainly won't guarantee that the above numbers are precisely correct, but it does give you some idea of the trouble. ? It is too early in the morning for me to remember the details of what is overlappable on the 3090/VF. Here I assume that the multiply can be overlapped with one of the loads. Since there is only one load-store unit, that leaves to more cycles for the other load and the store. * Note that the RS/6000 would only require 2N cycles for the equivalent multiplies except for the need to store a(i), which cannot be overlapped with either of the loads or the multiply. It would be especially interesting to add Intel i860 numbers to that table, since the i860 does not have full FP divide hardware and must iterate to get an IEEE-compliant result. -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET