Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!zaphod.mps.ohio-state.edu!mips!winchester!mash From: mash@mips.COM (John Mashey) Newsgroups: comp.arch Subject: Re: RISC vs CISC simple load benchmark; amazing ! [Not really] Message-ID: <39319@mips.mips.COM> Date: 12 Jun 90 03:35:06 GMT References: <8019@mirsa.inria.fr> Sender: news@mips.COM Reply-To: mash@mips.COM (John Mashey) Organization: MIPS Computer Systems Lines: 132 In article <8019@mirsa.inria.fr> jlf@mirsa.inria.fr (Jean-Louis Faraut) writes: >Therefore, we are looking at new RISC technology and how big CPU >manufacturers announced performances are is a matter of wondering for >me :-? >I try it with different commands to test CPU, I/O etc ... but for the >sake of brevity, I'll only present here results obtained from a >slightly modified version of the famous Bill Joy's test program, where >RISC are supposed to be better than CISC . > >Here is my version of the Joy's program : >======================================== >#!/bin/sh >echo 99k65536vvvvvp8op | dc >======================================== Well, unfortunately, there is un unfortunate bug in this benchmark, in that the behavior of this program in no way resembles most code typically run on general-purpose UNIX sytems, and you absolutely do NOT want to use it to help choose computers unless your workload happens to be multiple-precisions arithmetic doing lots of ulitplies and divides. Specifically, most RISC designers, after studying many programs, decided that integer multiply (and especially divide) were used less frequently than many other operations, and there is substantial data that backs this up from many vendors. RISC designers, depending on the benchmarks used, and amount of silicon available, allocated various amounts of silicon to support these operations, from zero up. The SPARC designers included a Multiply-Step, and no Divide-Step (i.e., divides are done by fair-sized hunk of code); HP PA included M-S and D-S; MIPS & 88K included both integer mult & divide in hardware, etc. However, for example, a typical integer divide on a MIPS takes about 35 cycles.... and probably about the same on a typical CISC. IF YOU WANT TO PROVE THAT A RISC IS NOT VERY MUCH BETTER THAN A CISC, OR EVEN WORSE, AT CPU PERFORMANCE: Use a program consisting of integer divides of 2 variables that the optimizers can't get rid of. a) This will show a MIPS or 88K at their least advantage. b) This will prove that a SPARC is the slowest thing in existence (well almost). While we (MIPS) thought divide was good to have,\ I must defend the SPARCers as not being irrational in leaving it out, given the constraints of the first implementation. Attached to the end of this is a brief sample of the MIPS prof/pixstats, which shows that on a MIPS-based machine (i.e., including DEC 5810), multiply/divide accounts for 1/3 of the cycles. This is NOT as bad as the even more classic "2^4096" dc benchmark, but it's still very high. (I tried email to jlf, but it bounced for some reason) jlf: if you want some solid CPU (single-user) data on realistic benchmarks, call the MIPS Paris office (33-1-42-04-0311) and ask for "SPEC Data Helps Normalize Vendor Mips-Ratings for Sensible Comparison OR Your Mileage May Vary, But If Your Car Were A Computer, It Would Vary More" Issue 2.0, May 1990. This is a giant analysis of all the published SPECmark data, and includes plenty of RISCs & CISCs..... IT doesn't tell you about multi-user stuff... PROF & PIXSTATS: stop here if you don't want gory details. Profile listing generated Mon Jun 11 19:49:24 1990 with: prof -pixie dc * -p[rocedures] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * procedure; unexecuted procedures are excluded * 18830106 cycles **INSTRUCTION CYCLES, NO CACHE MISSES, STALLS** cycles %cycles cum % cycles bytes procedure (file) /call /line 15732582 83.55 83.55 53695 37 div (dc.c) 1359698 7.22 90.77 5938 36 mult (dc.c) 342478 1.82 92.59 3142 32 getdec (dc.c) 307083 1.63 94.22 32 20 seekc (dc.c) 294895 1.57 95.79 3597 43 add (dc.c) **ALL REMAINING FUNCTIONS USED LESS THAN 1% OF THE INSTRUCTION CYCLES** * -h[eavy] using basic-block counts; * * sorted in descending order by the number of cycles executed in each * * line; unexecuted lines are excluded * procedure (file) line bytes cycles % cum % div (dc.c) 665 144 3260586 17.32 17.32 div (dc.c) 657 124 2781234 14.77 32.09 div (dc.c) 671 92 1911378 10.15 42.24 div (dc.c) 655 72 1648096 8.75 50.99 div (dc.c) 664 52 1124340 5.97 56.96 div (dc.c) 672 40 805894 4.28 61.24 div (dc.c) 658 40 739898 3.93 65.17 div (dc.c) 656 16 412024 2.19 67.36 div (dc.c) 667 12 337302 1.79 69.15 div (dc.c) 659 116 245129 1.30 70.45 mult (dc.c) 1097 100 233586 1.24 71.69 **ALL REMAINING STATEMENTS USED LESS THAN 1% OF THE INSTRUCTION CYCLES. pixstats dc: 27600801 (1.466) cycles (1.1s @ 25.0MHz) **INSTR CYCLES, INCL /* STALLS 18830106 (1.000) instructions ** INSTRUCTIONS ** 6807625 (0.362) loads 1994843 (0.106) stores 8802468 (0.467) loads+stores 8802468 (0.467) data bus use 8770695 (0.466) multiply/divide interlock cycles (12/35 cycles) **EXCEPTIONALLY HIGH: VERY FEW REAL PROGRAMS LOOK THIS WAY: 1/3 cycles IS WAITING FOR MUL OR DIV TO COMPLETE! ** 0.448 load nops per load **ALSO VERY HIGH: more typical is .25-.30 lnops/load** opcode distribution: lw 6350450 33.72% lnop 3051497 16.21% sw 1686506 8.96% bnop 1237018 6.57% bne 1162741 6.17% addu 893980 4.75% addiu 799847 4.25% beqz 719618 3.82% lb 449888 2.39% li 326381 1.73% sb 307835 1.63% sltu 303554 1.61% subu 256783 1.36% mflo 238028 1.26% div 238028 1.26% **YEP, RIGHT UP THERE** bcond 139925 0.74% bgez 129581 0.69% mfhi 114251 0.61% multu 114251 0.61% **AND THERE'S THE MULTIPLY** -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086