Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!jarthur!nntp-server.caltech.edu!nntp-server.caltech.edu!ph From: ph@ama-1.ama.caltech.edu (Paul Hardy) Newsgroups: comp.sys.mips Subject: MIPS assembler question Message-ID: Date: 30 Nov 90 19:11:39 GMT Sender: news@nntp-server.caltech.edu Distribution: comp Organization: California Institute of Technology Lines: 49 Nntp-Posting-Host: ama.caltech.edu I've just started programming in MIPS assembler, and I've written a routine to perform fast matrix multiplies. I am sustaining a rate of approximately 6.7 Mflops on a DECstation 5000, an IRIS 4D (using only one processor), and an ESV 10. All use the 25 MHz MIPS R3000. This is a higher number of MFlops than the vendors claim their machines can do, so I guess I should be pretty happy. However, I'm wondering why it's not going faster. This is probably a question for comp.arch.mips, but there's no such newsgroup. The main body of the multiply is a triplet of instructions: simultaneously, a load, add, and multiply are being performed on different registers. Since they're not using each others' registers, they should all execute together. According to the MIPS book, a single-precision floating-point multiply takes 6 cycles, but during the last two cycles another multiply can begin, so effectively it takes four cycles if many multiplies occur back-to-back. In reality, about 7 cycles elapse between multiplies. The code looks something like (where A, B*, C, D, E, F are single-precision floating point registers, and offset is a hard-coded constant): mul.s A, A, B1 lwc1 C, offset($BASE) add.s E, E, D mul.s C, C, B2 ## 1 cycle stall if load takes 2 cycles etc. A stalled load will hold up the following multiply if it takes more than three cycles to perform. Stalling the add shouldn't affect speed at all, since it's working on other data. Sticking nops above all the mul.s instructions didn't make any difference, so I took them out again. It would seem that loads are taking a long, long time. This is unfortunate, because all data is in cache. The only machine that page faulted during 100,000 iterations of the loop was the E&S machine: 9 times -- fairly insignificant. This is a trial with 10 x 10 matrices, so all of the data fits in one 1k page. All loads in the loop of the operation occur from sequential memory locations. This was done with hopes of decreasing access time on subsequent lookups from the same bank in a cache RAM. I write results in integer registers; they don't get written back into the cache until I'm out of registers (I hold about 20 values, so I perform one write every 380 floating point operations for a 10 x 10 matrix). Does anyone have any experience with this? Where are the extra 3 cycles going? How long does it _really_ take to load a value from cache? If it does take a lot more than 2 cycles, then I could relax make the subroutine a lot more flexible. By the way, this is a very nice assembler language to program in! --Paul