Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!mcsun!ukc!ox-prg!prg.ox.ac.uk!as From: as@prg.ox.ac.uk (Andrew Stevens) Newsgroups: comp.sys.acorn Subject: Re: My ARM2's faster than an ARM3. Waaaa. Message-ID: <1308@culhua.prg.ox.ac.uk> Date: 19 Feb 91 11:19:38 GMT Sender: news@prg.ox.ac.uk Reply-To: as@prg.ox.ac.uk (Andrew Stevens) Organization: Oxford University Computing Laboratory, UK Lines: 78 hughesmp@vax1.tcd.ie writes ... >Also, how difficult would it be to clock the _entire_ system at 30MHz, ... Very *expensive* at those kind of clock rates propagating clocks and signals (etc) any kind of distance becomes soooo much harder. Current affordable RAM's top out at around 16Mhz (give or take some). > Alternatively, how difficult would it > be to get a decent sized cache, say 64k or 256k or something? Difficult enough to make it rather costly. A nice big fat, hacky, custom chip plus the necessary static RAM's. Caches on PC's are affordable because standard chippery already exists for the job... >For my case, the bits of the demo that use in-line code, >used it as an obvious implementation of >the problems, ie 200 pixel diameter sphere-wrapping (on the new version), >because it just looks like the fastest implementation. Here lies the heart of your problem... This is the old gotcha that naive in-lining does not necessarily improve performance on processors using caches. If you make your loops big enough, you blow away the cache, and lose in a major way. Although inlining speeds things up by a factor: loop body / (loop body + loop overhead) it also slows it down by a factor cache degradation * cache clock / ram clock plus, probably, a significant bit extra since caches typically refill multiple words at a time, may need to sync with external clocks, etc etc Thus, since on modern CPU's loop overhead is usually small, and (cache clock / ram clock) large the technique is rarely worth pursuing on machines with small caches. >Even putting this into >a loop, the amount of data it operates on, in the order of 250k per frame sync >or something, the code would be overwritten as the data is loaded in, >and so it would need to be cached again, causing similar problems. I am surprised by this - did you try the loopy version? Even assuming the ARM-3 has a very naive direct mapped cache you would still expect the cache to retain a good proportion of the loop code, unless the loop is big enough to be a good part of the size of the cache. Data access should not blow it away that badly, the innermost loop, surely, cannot access more than a few hundreds of bytes per iteration? If data cache flushing is a problem you might find a bit cunning about how you lay out / traversed the data in memory helps a lot. Furthermore, I *think* the ARM-3's cache is associative. Several cache entries with different high address bits can be distinguished. Data really shouldn't mash it *that* badly. Even TeX and Prolog cache o.k.ish and they're pretty pathological. > ... And if I can't, why does the processor put the data in a >pseudo-random location en-cache? >Surely sequential locations would be more logical, because it would >take longer for code to be overwritten by accessed data, is it a) VLSI have >their reasons, or b) it isn't more logical? I take it what you mean is why doesn't the cache flush on a purely last-in first-out basis. The short answer is that despite being a sort of o.k. strategy (most of the time) it would hopelessly slow and area guzzling to implement in silicon. Better overall performance is achieved by using a ``stupider'' cache that can be made usefully large and fast. Furthermore, if you don't try to randomize cache flushing a bit it is very easy to run into disastrous pathological cases. E.g. a loop that is the size of the cache would get no benefit at all under a strictly last-in first-out strategy. Andrew Stevens Programmming Research Group JANET: Andrew.Stevens@uk.ac.oxford.prg Oxford University Computing Laboratory INTERNET: Andrew.Stevens@prg.ox.ac.uk 11 Keble Road, Oxford, England UUCP: ...!uunet!mcvax!ukc!ox-prg!as OX1 3QD