Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!pioneer!lamaster
From: lamaster@pioneer.UUCP
Newsgroups: comp.sys.m68k
Subject: Re: Recent Motorola ad seen in Byte
Message-ID: <1274@ames.UUCP>
Date: Wed, 15-Apr-87 11:54:41 EST
Article-I.D.: ames.1274
Posted: Wed Apr 15 11:54:41 1987
Date-Received: Fri, 17-Apr-87 02:12:45 EST
References: <362@sbcs.UUCP> <1466@ncr-sd.SanDiego.NCR.COM> <580@plx.UUCP> <513@omen.UUCP> <285@winchester.mips.UUCP> <518@omen.UUCP>
Sender: usenet@ames.UUCP
Reply-To: lamaster@pioneer.UUCP (Hugh LaMaster)
Distribution: comp
Organization: NASA Ames Research Center, Moffett Field, Calif.
Lines: 69
Keywords: Sieve, 68020 vs. 80386, benchmarks, caches

In article <518@omen.UUCP> caf@.UUCP (PUT YOUR NAME HERE) writes:
>In article <285@winchester.mips.UUCP> djl@mips.UUCP (Dan Levin) writes:
>:If what you care about is performance of a processor on very small
>:integer compute loops, then use sieve and its ilk.  If what you care
>:about is performance under actual application conditions, you must
>:use benchmarks that more accurately reproduce those types of environments.
>
>I thought many programs spend time looping in fairly localized loops,
>especially on machines that lack high powered string instructions that
>are useful to C.  What are "for" and "while" statements for?  Note that
>
>While it is conceivable that Motorola put the 256 byte cache in the 68020
>just to help certain benchmarks,  it is more likely that the cache
>actually improves performance rather inexpensively.

I agree.  Two points:

1)  To see whether a small cache is really going to help performance, you need
to look at the memory reference pattern of generated code carefully.  I assume
that motorola did this.  A problem with small caches that are shared between
code and data is that data references can "take over" the cache and force
unnecessary memory references for instruction fetches, holding up instruction
issue on a pipelined machine.  A separate instruction cache is usually
indicated in my opinion, with the additional benefit of doubling the "cache
bandwidth" without complicated logic.   Seymour Cray/Control Data/ETA/Neil
Lincoln, etc. lineage of machines have always had small (e.g. in the range of
32-256 instructions) instruction caches, and NO data caches (but sometimes
hundreds of registers).  These semi-risc load/store pipelined (including
memory references) machines demonstrated very good performance on a wide
variety of code; they did especially well on number crunching when code
segments that fit in cache (the instruction "stack" on older machines - almost
a cache ...) demonstrated a scalar speed up of a factor of two or more, often,
on the many loops that fit in cache.  A small cache can do a great deal of
good on a pipelined machine if the net effect is to speed up instruction
issue.  

The effect of data caches is much less pronounced or predictable.  I believe
that a majority of engineering/scientific codes and also system code have
widely scattered memory reference patterns.  So, I am much more sceptical of
the effect of data caches on real world problems.  However, a small data cache
is sometimes a useful substitute for registers (enter RISC debate). Data
caches do have an exaggerated effect on many of the popular small benchmarks
like Dhrystone, unfortunately.  The writers of these benchmarks usually take
the easy way out because it is hard to duplicate average code in a concise
benchmark.  And systems with data caches sometimes appear faster than they are
in the real world. There are Engineering/Scientific benchamarks that do not
have so much of this problem (e.g. the Dongarra Linpack benchmark, with the
dimensions appropriately scaled).

There does seem to be a place out there for a better scalar/system code
benchmark.  I think a tree sort of a large random array is probably a better
overall benchmark (of the inherent CPU speed) than Dhrystone, but does not
provide as good coverage of the types of code that are usually encountered.
Byte addressing, extraction of fields from records, string searches, etc. are
usually considered desirable in a benchmark to ferret out architectural or
compiler weaknesses, but the correct way to test performance on these types of
codes is a matter of debate :-)


  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")