Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!pcg From: pcg@aber-cs.UUCP (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: Black magic, IBM RIOS. Message-ID: <1726@aber-cs.UUCP> Date: 12 Apr 90 13:14:33 GMT Reply-To: pcg@cs.aber.ac.uk (Piercarlo Grandi) Organization: Dept of CS, UCW Aberystwyth (Disclaimer: my statements are purely personal) Lines: 54 In article <6438@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: In article <1990Apr4.140713.8996@specialix.co.uk> jpp@specialix.co.uk (John Pettitt) writes: >pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes: >> register: 16M*7 instructions in 1.3 s >> volatile static: 16M*10 instructions in 10.4 s >Same test on a mips 3240 (25Mhz R3000) > register: 1.0 s > volatile static: 8.9 s >and on an ALR 486 with 128K cache > register: 4.8 s > volatile static: 9.5 s >Hmmm interesting. Why are these numbers interesting or magic? The RIOS numbers are interesting because we are looking at a 20Mhz 530 doing around 70 MIPS peak. It drops to about 15 MIPS if the same program is rerun with variables in memory. This means that not only we pay the cost of cache transations, we also lose superscalarity. The MIPS tests show that the meaning of this test has not been understood; the crucial inner loop has been unrolled, and thus the test has become one of language speed, not one of CPU/memory architecture/speed. Since the program is meaningless, looking at just how fast it runs, without looking at the generated code, is pointless. The 486 figures show that the 486 has remarkable performance. The typical 20 Mhz RISC chip has a register time of just over 3 seconds, with usually 5-6 instructions in the inner loop. The 486 is not as quick, but the difference for the variables in memory case is much smaller. This means that, as expected, RISCs get bogged down (load-store architecture) by memory accesses. On very fast machines, keeping important values in registers is a big savings. You execute less instructions, and you spend much less time waiting on memory. This is a true but uninteresting. It is the *magnitudes* of the effect that are most interesting. When the same loop (which mimics many hot spots in real world programs) on a superscalar exhibits a factor in running time of 8 depending on whether the variables are (for CISCs the factor tends to be 2, for RISCS it tends to be 5) in registers or memory, you start believing religiously in Von Neumann's bottleneck, at least until we coax the chip guys to deliver faster memory, not just larger. When you look at transistor counts as well you wonder even more whether a low level of NUMA external parallelism (2-6 CPUs) is dearer/cheaper faster/slower than a low level of internal parallelism (superscalar). However, for a better discussion of these issues, please wait for my forthcoming table of figures for a couple dozen CPU types. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk