Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpcc05!hpyhde4!hpycla!hpcuhc!hpcupt3!daryl From: daryl@hpcupt3.cup.hp.com (Daryl Odnert) Newsgroups: comp.benchmarks Subject: Re: A question about flat out Snake speed. Message-ID: <45780003@hpcupt3.cup.hp.com> Date: 6 Apr 91 20:38:43 GMT References: <7834@idunno.Princeton.EDU> Organization: Hewlett Packard, Cupertino Lines: 64 Here are the key things to consider in formulating an answer to Steve's question: o One instruction is issued to either the CPU or the floating-point coprocessor in each cycle. o The central processor and coprocessor are both running at the same speed (either 50MHz or 66MHz, depending on the model.) o The CPU can execute instructions in parallel with multi-cycle floating-point operations. o Floating-point add and multiply operations have three cycle latencies, regardless of precision (double or single). o There are independent ALU and MPY functional units within the floating-point coprocessor which can operate in parallel. o Because they are pipelined, the the ALU and MPY units can start a flop every other cycle as long as no data dependencies are present. o Assuming no cache misses, loads take 2 cycles to complete. No interlocks will occur as long as the instruction executed immediately after the load does not reference the load target register. o The pipeline penalty for stores is zero, one, or two cycles, depending on the distance between the store instruction and the next memory reference. In other words, a store followed immediately by another memory reference will suffer a two cycle stall. o The FMPYADD instruction allows *independent* multiplication and addition operations to be dispatched in a single cycle. This is NOT a multiply-and-accumulate operation. If there are no memory references and you alternate multiply and add operations, the coprocessor has a peak performance rate of 66 megaflops. The DAXPY loop consists of 5 operations per vector element, 2 double-precision loads, 1 multiply, 1 add, and 1 double-precision store. If the loop can be scheduled such that there are no interlocks, a non-superscalar PA-RISC machine would hit a peak rate of 2 flops every 5 cycles in the inner loop. Thus, peak performance for the DAXPY loop on the 66MHz Snakes box is: 66 million instructions per second * (2 flops / 5 instructions) = 26.4 MFLOPS The FORTRAN compiler available at the first release of the HP 9000/700 would automatically unroll the inner loop of DAXPY 4-times. The optimizer is able to use the FMPYADD instructions and schedule the loop in such a way that each iteration executes in 22 cycles. Each iteration is executing 8 flops. This result in performance rating of 66 * (8 / 22) = 24 MFLOPS in the inner loop. Thus at the present time, the compilers are achieving about 90% of the peak performance potential on this particular loop (assuming all data fits in the cache.) I expect that a future releases of the compiler will achieve 100% of the potential. Regards, Daryl Odnert daryl@hpcllla.cup.hp.com Hewlett-Packard California Language Lab Cupertino, California