Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uunet!mcsun!ukc!dcl-cs!aber-cs!pcg From: pcg@aber-cs.UUCP (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: Black magic, IBM RIOS. Summary: beware of loop unrolling... I tested a 20MHz S/6000 with a backlevel compiler and managed to [ only ] produce 2.1 seconds for the register loop. I don't know if it was 7 instructions, but assuming it was yields 4096 x 4096 x 7 / 2.1 = 55,9 Message-ID: <1719@aber-cs.UUCP> Date: 11 Apr 90 21:13:59 GMT Reply-To: pcg@cs.aber.ac.uk (Piercarlo Grandi) Organization: Dept of CS, UCW Aberystwyth (Disclaimer: my statements are purely personal) Lines: 34 The simple test I have produced is meaningless without actual instruction counting, and analysis of the inner loop. It is a CPU/memory level test, and not a system level test; the goal is not to look at the cleverest optimization to shorten the time, but *how* the time is spent. J. Pettitt wrote me that recent MIPS compilers unroll the inner loop; in theory it can be replaced by 4096 additions of 277 executed once, or even by 1/N additions of N*277 (where N*277 < MAX_INT). In any case the MIPS compilers I have seen on DEC machines get a 5 instruction inner loop, which in the best case (a 5840) is executed in around 3 seconds. The IBM machines have 7 instruction sin the inner loop because the compiler does *very* hairy things with scheduling. Let me repeat: my simple benchmark is not for *system* performance analysis; by themselves the numbers are almost meaningless; analysis of generated code is vital. It is only a tool with which to study the CPU and memory subsystem architectures, just like the other, very interesting, cache busting benchmarks recently discussed. To me, the most interesting information in my benchmark results is the range of variation between the times for the various storage classes. When, like in the IBM RIOS example, declaring the loop variables differently results in a time variation of nearly 8 times, that's interesting. It also means that actual *system* performance levels will be incredibly dependent on memory or register access patterns even at a very microscopic level. The RIOS cache busting benchmarks recently posted also reveal an incredible range of variation, at a less microscopic level. This of course has profound consequences. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk