Path: utzoo!attcan!uunet!lll-winken!gauss.llnl.gov!casey
From: casey@gauss.llnl.gov (Casey Leedom)
Newsgroups: comp.arch
Subject: Re: RISC vs CISC simple load benchmark; amazing ! [Not really]
Message-ID: <61780@lll-winken.LLNL.GOV>
Date: 15 Jun 90 19:20:32 GMT
References: <8019@mirsa.inria.fr> <39319@mips.mips.COM> <675@sibyl.eleceng.ua.OZ> <39397@mips.mips.COM>
Sender: usenet@lll-winken.LLNL.GOV
Reply-To: casey@gauss.llnl.gov (Casey Leedom)
Organization: Lawrence Livermore National Laboratory
Lines: 36

| From: mash@mips.COM (John Mashey)
| 
| The worst case performance is not all that interesting: for two cached
| machines with different cache organization, you can usually "prove"
| different ratios of relative performance by careful selection of the
| most relevant cache-busting code.
| 
| [A good example] on a direct-mapped, virtual cache machine, is
| to copy, 1 byte at a time, between two areas that collide in the cache.
| 
| (i.e., if you want to artificially show off a SPARC 490 at its worst,
| you can probably prove its slower than a 68020 with such a benchmark).
| Of course, any given machine can be done in this way.

  While I agree with you that one can always come up with cache-busting code,
I think that you picked a particularly bad example as I think that the cache
design in question is just brain dead.  If you design a direct mapped cache
you should have at least two buckets for each cache line.  Linearly doing
things to two different arrays is so common that you're bound to run into
the problem you mention.

	(As proof, I was called in on a problem with a Sun 3/280 that was
	bought for image processing.  Part of their processing involved,
	essentially, copying a 1/4Mb array 30 times a second.  The group had
	justified buying the 280 on the grounds that a 180 just wouldn't be
	fast enough.  Imagine their horror when they ran their code on their
	brand new 280 and found that it ran 3 times slower than on a 180!
	The problem turned out to be that the two arrays were an exact
	multiple of 64Kb away from each other -- the size of the 280's cache.
	Eventually I was able to bring the 280 up to the speed of a 180 by
	offsetting the arrays by 24 bytes from 64Kb.  (There were actually
	a bunch of fast nodal points, but you get the idea.))

  A cache shouldn't break on common operations.

Casey