Newsgroups: comp.benchmarks Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!van-bc!ubc-cs!uw-beaver!rice!ariel.rice.edu!preston From: preston@ariel.rice.edu (Preston Briggs) Subject: Re: A benchmark Message-ID: <1991May3.053053.29174@rice.edu> Keywords: snake performance Sender: news@rice.edu (News) Organization: Rice University, Houston References: <1991May3.023705.5616@marlin.jcu.edu.au> Date: Fri, 3 May 91 05:30:53 GMT csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: > CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran). > The code is well vectorized, and consists mostly of floating point > multiplies/additions. It's important to remember that code that is ideal for a vector machine is not necessarily ideal for a scalar (or super-scalar) machine. Yes, the IBM and HP machines can cook on vector code, but often it can be rearranged for even better performance. For example, on vector machine, we don't like to see recurrences in the inner loop. On scalar machines, these are desirable. A simple (contrived) example: This is ok for vector machines DO j = 1, n DO i = 1, n A(i) = A(i) + B(j) ENDDO ENDDO but this is better for scalar machines (and terrible for vector machines, because of the recurrence on A(i)) DO i = 1, n DO j = 1, n A(i) = A(i) + B(j) ENDDO ENDDO Why? In the first case, we'll hold the inner-loop invariant B(j) in a register. Therefore, we'll require 1 load and 1 store for each flop. In the 2nd case, we'll hold A(i) in a register across the inner loop, requiring only one load per flop, with no stores in the inner loop. We can further munch the second example, by unrolling the outer loop and jamming the resulting inner loop bodies together DO i = 1, n, 4 DO j = 1, n A(i+0) = A(i+0) + B(j) A(i+1) = A(i+1) + B(j) A(i+2) = A(i+2) + B(j) A(i+3) = A(i+3) + B(j) ENDDO ENDDO In this case, we'll hold 4 parts of A in registers, and require only one load of B for every 4 flops. This also helps get better scheduling for the pipelines. So, the point is that the results of measuring "well vectorized" code will tend to favor vector machines. By reworking the code (a lot?), ala the Perfect Club, you should be able to achieve even better performance on the scalar machines. Preston Briggs