Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!pcg
From: pcg@aber-cs.UUCP (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Black magic, IBM RIOS.
Message-ID: <1726@aber-cs.UUCP>
Date: 12 Apr 90 13:14:33 GMT
Reply-To: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Organization: Dept of CS, UCW Aberystwyth
	(Disclaimer: my statements are purely personal)
Lines: 54

In article <6438@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
  In article <1990Apr4.140713.8996@specialix.co.uk> jpp@specialix.co.uk (John Pettitt) writes:
  >pcg@odin.cs.aber.ac.uk (Piercarlo Grandi) writes:
  >>	register:		16M*7 instructions in 1.3 s
  >>	volatile static:	16M*10 instructions in 10.4 s
  >Same test on a mips 3240 (25Mhz R3000)
  >	register:		1.0 s
  >	volatile static:	8.9 s
  >and on an ALR 486 with 128K cache
  >	register:		4.8 s
  >	volatile static:	9.5 s
  >Hmmm interesting.
  
  Why are these numbers interesting or magic?

The RIOS numbers are interesting because we are looking at a 20Mhz 530
doing around 70 MIPS peak. It drops to about 15 MIPS if the same program
is rerun with variables in memory. This means that not only we pay the cost
of cache transations, we also lose superscalarity.

The MIPS tests show that the meaning of this test has not been understood;
the crucial inner loop has been unrolled, and thus the test has become one
of language speed, not one of CPU/memory architecture/speed. Since the
program is meaningless, looking at just how fast it runs, without looking at
the generated code, is pointless.

The 486 figures show that the 486 has remarkable performance. The typical 20
Mhz RISC chip has a register time of just over 3 seconds, with usually 5-6
instructions in the inner loop. The 486 is not as quick, but the difference
for the variables in memory case is much smaller. This means that, as
expected, RISCs get bogged down (load-store architecture) by memory accesses.
  
  On very fast machines, keeping important values in registers is a big
  savings.  You execute less instructions, and you spend much less time
  waiting on memory.

This is a true but uninteresting. It is the *magnitudes* of the effect that
are most interesting. When the same loop (which mimics many hot spots in
real world programs) on a superscalar exhibits a factor in running time of 8
depending on whether the variables are (for CISCs the factor tends to be 2,
for RISCS it tends to be 5) in registers or memory, you start believing
religiously in Von Neumann's bottleneck, at least until we coax the chip
guys to deliver faster memory, not just larger.

When you look at transistor counts as well you wonder even more whether a low
level of NUMA external parallelism (2-6 CPUs) is dearer/cheaper faster/slower
than a low level of internal parallelism (superscalar).

However, for a better discussion of these issues, please wait for my
forthcoming table of figures for a couple dozen CPU types.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk