Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!mcsun!ukc!ox-prg!prg.ox.ac.uk!as
From: as@prg.ox.ac.uk (Andrew Stevens)
Newsgroups: comp.sys.acorn
Subject: Re: My ARM2's faster than an ARM3. Waaaa.
Message-ID: <1308@culhua.prg.ox.ac.uk>
Date: 19 Feb 91 11:19:38 GMT
Sender: news@prg.ox.ac.uk
Reply-To: as@prg.ox.ac.uk (Andrew Stevens)
Organization: Oxford University Computing Laboratory, UK
Lines: 78


hughesmp@vax1.tcd.ie writes ...

>Also, how difficult would it be to clock the _entire_ system at 30MHz, ...
Very *expensive*  at those kind of clock rates propagating clocks and
signals (etc) any kind of distance becomes soooo much harder.  Current
affordable RAM's top out at around 16Mhz (give or take some).

> Alternatively, how difficult would it
> be to get a decent sized cache, say 64k or 256k or something?
Difficult enough to make it rather costly.  A nice big fat, hacky, custom
chip plus the necessary static RAM's.   Caches on PC's are affordable
because standard chippery already exists for the job...

>For my case, the bits of the demo that use in-line code,
>used it as an obvious implementation of
>the problems, ie 200 pixel diameter sphere-wrapping (on the new version),
>because it just looks like the fastest implementation.

Here lies the heart of your problem...
This is the old gotcha that naive in-lining does not necessarily improve
performance on processors using caches.  If you make your
loops big enough, you blow away the cache, and lose in a major
way.  Although inlining speeds things up by a factor:

	loop body / (loop body + loop overhead)

it also slows it down by a factor

	cache degradation * cache clock / ram clock

plus, probably, a significant bit extra since caches typically
refill multiple words at a time, may need to sync with external clocks,
etc etc

Thus, since on modern CPU's loop overhead is usually small,
and (cache clock / ram clock) large the technique is rarely worth
pursuing on machines with small caches.  

>Even putting this into
>a loop, the amount of data it operates on, in the order of 250k per frame sync
>or something, the code would be overwritten as the data is loaded in,
>and so it would need to be cached again, causing similar problems.

I am surprised by this - did you try the loopy version?  Even
assuming the ARM-3 has a very naive direct mapped cache you would still
expect the cache to retain a good proportion of the loop code, unless
the loop is big enough to be a good part of the size of the cache.  Data
access should not blow it away that badly, the innermost loop, surely,
cannot access more than a few hundreds of bytes per iteration?  If data
cache flushing is a problem you might find a bit cunning about how you
lay out / traversed the data in memory helps a lot.  Furthermore, I *think*
the ARM-3's cache is associative. Several cache entries with different
high address bits can be distinguished. Data really shouldn't mash
it *that* badly.  Even TeX and Prolog cache o.k.ish and they're
pretty pathological.

> ... And if I can't, why does the processor put the data in a
>pseudo-random location en-cache?
>Surely sequential locations would be more logical, because it would
>take longer for code to be overwritten by accessed data, is it a) VLSI have 
>their reasons, or b) it isn't more logical?

I take it what you mean is why doesn't the cache flush on a purely
last-in first-out basis.  The short answer is that despite being a
sort of o.k. strategy (most of the time) it would hopelessly slow and
area guzzling to implement in silicon.  Better overall performance
is achieved by using a ``stupider'' cache that can be made usefully large
and fast.  Furthermore, if you don't try to randomize cache flushing a bit
it is very easy to run into disastrous pathological cases.  E.g.
a loop that is the size of the cache would get no benefit at all under
a strictly last-in first-out strategy.

        Andrew Stevens                  
      Programmming Research Group       JANET: Andrew.Stevens@uk.ac.oxford.prg         
 Oxford University Computing Laboratory INTERNET: Andrew.Stevens@prg.ox.ac.uk
     11 Keble Road, Oxford, England     UUCP:  ...!uunet!mcvax!ukc!ox-prg!as
     OX1 3QD