Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uunet!mcsun!ukc!dcl-cs!aber-cs!pcg
From: pcg@aber-cs.UUCP (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Black magic, IBM RIOS.
Summary: beware of loop unrolling...
  
  I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ]
  produce 2.1 seconds for the register loop.  I don't know if it was 7
  instructions, but assuming it was yields
  
  	4096 x 4096 x 7 / 2.1 = 55,9
Message-ID: <1719@aber-cs.UUCP>
Date: 11 Apr 90 21:13:59 GMT
Reply-To: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Organization: Dept of CS, UCW Aberystwyth
	(Disclaimer: my statements are purely personal)
Lines: 34

The simple test I have produced is meaningless without actual instruction
counting, and analysis of the inner loop. It is a CPU/memory level test, and
not a system level test; the goal is not to look at the cleverest optimization
to shorten the time, but *how* the time is spent.

J. Pettitt wrote me that recent MIPS compilers unroll the inner loop; in
theory it can be replaced by 4096 additions of 277 executed once, or even by
1/N additions of N*277 (where N*277 < MAX_INT).

In any case the MIPS compilers I have seen on DEC machines get a 5
instruction inner loop, which in the best case (a 5840) is executed in
around 3 seconds. The IBM machines have 7 instruction sin the inner loop
because the compiler does *very* hairy things with scheduling.

Let me repeat: my simple benchmark is not for *system* performance analysis;
by themselves the numbers are almost meaningless; analysis of generated code
is vital.  It is only a tool with which to study the CPU and memory subsystem
architectures, just like the other, very interesting, cache busting
benchmarks recently discussed.

To me, the most interesting information in my benchmark results is the range
of variation between the times for the various storage classes. When, like
in the IBM RIOS example, declaring the loop variables differently results in
a time variation of nearly 8 times, that's interesting. It also means that
actual *system* performance levels will be incredibly dependent on memory or
register access patterns even at a very microscopic level.  The RIOS cache
busting benchmarks recently posted also reveal an incredible range of
variation, at a less microscopic level.

This of course has profound consequences. 
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk