Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!zaphod.mps.ohio-state.edu!mips!flash!rowen
From: rowen@mips.COM (Chris Rowen)
Newsgroups: comp.sys.mips
Subject: Re: MIPS assembler question
Summary: Suggested reordering of floating point instructions to exploit MIPS R3010 pipeline
Message-ID: <43786@mips.mips.COM>
Date: 4 Dec 90 18:10:36 GMT
References: <PH.90Nov30111139@ama-1.ama.caltech.edu>
Sender: news@mips.COM
Distribution: comp
Lines: 55

Paul Hardy (ph@ama-1.ama.caltech.edu) writes: 
>The main body of the multiply is a triplet of instructions: simultaneously,
>a load, add, and multiply are being performed on different registers.  Since
>they're not using each others' registers, they should all execute together.
>According to the MIPS book, a single-precision floating-point multiply takes
>6 cycles, but during the last two cycles another multiply can begin, so
>effectively it takes four cycles if many multiplies occur back-to-back.
>In reality, about 7 cycles elapse between multiplies.  The code looks
>something like (where A, B*, C, D, E, F are single-precision floating point
>registers, and offset is a hard-coded constant):
>
>                   mul.s    A, A, B1
>                   lwc1     C, offset($BASE)
>                   add.s    E, E, D
>                   mul.s    C, C, B2    ## 1 cycle stall if load takes 2 cycles
>                   etc.
>
>Does anyone have any experience with this?  Where are the extra 3 cycles going?
>How long does it _really_ take to load a value from cache?  If it does take a
>lot more than 2 cycles, then I could relax make the subroutine a lot more
>flexible.

As I recall, the relevant pipelining rules of the R3010 are the following:

1) An ADD cannot start or finish in cycle in which a MUL starts or finishes
2) Only one instruction can start in any cycle
3) A load can finish in any cycle   

This means that the add cannot start until the multiply has completed

Pipelining of instructions as coded:

CYCLE	  1     2      3     4      5      6      7      8      9    10    11
mul.s   START------ ------ RESULT 
lwc1         START  RESULT
add.s                             START RESULT
mul.s                                           START ------ ------ RESULT
lwc1                                                  START  RESULT    
add.s                                                                     START

This is six cycles per triple.

If you can reorder the code a little, it should get faster:

CYCLE	  1     2      3     4      5      6      7      8      9    10    11
mul.s   START------ ------ RESULT 
add.s         START RESULT
lwc1                START  RESULT
mul.s                      START ------ ------ RESULT
add.s                             START RESULT
lwc1                                    START  RESULT 

This is three cycles per triple.

Chris Rowen