Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpcc05!hpyhde4!hpycla!hpcuhc!hpcupt3!daryl
From: daryl@hpcupt3.cup.hp.com (Daryl Odnert)
Newsgroups: comp.benchmarks
Subject: Re: A question about flat out Snake speed.
Message-ID: <45780003@hpcupt3.cup.hp.com>
Date: 6 Apr 91 20:38:43 GMT
References: <7834@idunno.Princeton.EDU>
Organization: Hewlett Packard, Cupertino
Lines: 64

Here are the key things to consider in formulating an answer to
Steve's question:

   o  One instruction is issued to either the CPU or the floating-point
      coprocessor in each cycle.

   o  The central processor and coprocessor are both running at the same
      speed (either 50MHz or 66MHz, depending on the model.)

   o  The CPU can execute instructions in parallel with multi-cycle
      floating-point operations.

   o  Floating-point add and multiply operations have three cycle
      latencies, regardless of precision (double or single).

   o  There are independent ALU and MPY functional units within the
      floating-point coprocessor which can operate in parallel.

   o  Because they are pipelined, the the ALU and MPY units can
      start a flop every other cycle as long as no data dependencies
      are present.

   o  Assuming no cache misses, loads take 2 cycles to complete.
      No interlocks will occur as long as the instruction executed
      immediately after the load does not reference the load target
      register.

   o  The pipeline penalty for stores is zero, one, or two cycles,
      depending on the distance between the store instruction and
      the next memory reference.  In other words, a store followed
      immediately by another memory reference will suffer a two
      cycle stall.

   o  The FMPYADD instruction allows *independent* multiplication
      and addition operations to be dispatched in a single cycle.
      This is NOT a multiply-and-accumulate operation.

If there are no memory references and you alternate multiply and add
operations, the coprocessor has a peak performance rate of 66 megaflops.

The DAXPY loop consists of 5 operations per vector element, 2 double-precision
loads, 1 multiply, 1 add, and 1 double-precision store.  If the loop can be
scheduled such that there are no interlocks, a non-superscalar PA-RISC machine
would hit a peak rate of 2 flops every 5 cycles in the inner loop.

Thus, peak performance for the DAXPY loop on the 66MHz Snakes box is:

66 million instructions per second * (2 flops / 5 instructions) = 26.4 MFLOPS

The FORTRAN compiler available at the first release of the HP 9000/700 would
automatically unroll the inner loop of DAXPY 4-times.  The optimizer is
able to use the FMPYADD instructions and schedule the loop in such a way
that each iteration executes in 22 cycles.  Each iteration is executing
8 flops.  This result in performance rating of 66 * (8 / 22) = 24 MFLOPS
in the inner loop.  Thus at the present time, the compilers are achieving
about 90% of the peak performance potential on this particular loop (assuming
all data fits in the cache.)  I expect that a future releases of the compiler
will achieve 100% of the potential.

Regards,
Daryl Odnert       daryl@hpcllla.cup.hp.com
Hewlett-Packard
California Language Lab
Cupertino, California