Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!jarthur!nntp-server.caltech.edu!nntp-server.caltech.edu!ph
From: ph@ama-1.ama.caltech.edu (Paul Hardy)
Newsgroups: comp.sys.mips
Subject: MIPS assembler question
Message-ID: <PH.90Nov30111139@ama-1.ama.caltech.edu>
Date: 30 Nov 90 19:11:39 GMT
Sender: news@nntp-server.caltech.edu
Distribution: comp
Organization: California Institute of Technology
Lines: 49
Nntp-Posting-Host: ama.caltech.edu


I've just started programming in MIPS assembler, and I've written a routine
to perform fast matrix multiplies.  I am sustaining a rate of approximately
6.7 Mflops on a DECstation 5000, an IRIS 4D (using only one processor), and
an ESV 10.  All use the 25 MHz MIPS R3000.  This is a higher number of MFlops
than the vendors claim their machines can do, so I guess I should be pretty
happy.  However, I'm wondering why it's not going faster.  This is probably
a question for comp.arch.mips, but there's no such newsgroup.

The main body of the multiply is a triplet of instructions: simultaneously,
a load, add, and multiply are being performed on different registers.  Since
they're not using each others' registers, they should all execute together.
According to the MIPS book, a single-precision floating-point multiply takes
6 cycles, but during the last two cycles another multiply can begin, so
effectively it takes four cycles if many multiplies occur back-to-back.
In reality, about 7 cycles elapse between multiplies.  The code looks
something like (where A, B*, C, D, E, F are single-precision floating point
registers, and offset is a hard-coded constant):

                   mul.s    A, A, B1
                   lwc1     C, offset($BASE)
                   add.s    E, E, D
                   mul.s    C, C, B2    ## 1 cycle stall if load takes 2 cycles
                   etc.

A stalled load will hold up the following multiply if it takes more than
three cycles to perform.  Stalling the add shouldn't affect speed at all,
since it's working on other data.  Sticking nops above all the mul.s
instructions didn't make any difference, so I took them out again.  It would
seem that loads are taking a long, long time.  This is unfortunate, because
all data is in cache.  The only machine that page faulted during 100,000
iterations of the loop was the E&S machine: 9 times -- fairly insignificant.
This is a trial with 10 x 10 matrices, so all of the data fits in one 1k page.
All loads in the loop of the operation occur from sequential memory locations.
This was done with hopes of decreasing access time on subsequent lookups from
the same bank in a cache RAM.  I write results in integer registers; they
don't get written back into the cache until I'm out of registers (I hold about
20 values, so I perform one write every 380 floating point operations for a
10 x 10 matrix).

Does anyone have any experience with this?  Where are the extra 3 cycles going?
How long does it _really_ take to load a value from cache?  If it does take a
lot more than 2 cycles, then I could relax make the subroutine a lot more
flexible.

By the way, this is a very nice assembler language to program in!


                                  --Paul