Xref: utzoo rec.games.programmer:3408 comp.os.msdos.programmer:4626
Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!batcomputer!cornell!rochester!pt.cs.cmu.edu!o.gp.cs.cmu.edu!netnews
From: Ralf.Brown@B.GP.CS.CMU.EDU
Newsgroups: rec.games.programmer,comp.os.msdos.programmer
Subject: Re: 3D int/float optimizations stuff
Message-ID: <280706ef@ralf>
Date: 13 Apr 91 13:26:07 GMT
Sender: netnews@cs.cmu.edu (USENET News Group Software)
Organization: Carnegie Mellon University School of Computer Science
Lines: 123
In-Reply-To: <28002@uflorida.cis.ufl.EDU>

In article <28002@uflorida.cis.ufl.EDU>, jdb@reef.cis.ufl.edu (Brian K. W. Hook) wrote:
}Thanks to everyone who helped with the optimizations.  For those
}interested, I am posting the results of each optimization followed by the
}final source code.
}
}Summary:  WOW!
}
}First pass:     21.86 seconds
}Last pass:      12.20 seconds
}
}I am sure that a couple of optimizations can still be done, most obviously
}those bit shifts (although I really doubt they matter much).  I am not sure
}how accurate these calculations are, but I do know that they don't distort

I don't remember which compiler you said you are using, but if it is a 16-bit
compiler (such as MSC, Zortech, or Turbo), then both multiplies and shifts on
longs make calls to the runtime library.  As I recall, you said the function
originally used 90% of the execution time; with the following assembler
version of the function, you should get your execution time down to under
six seconds.  Note that I've rearranged the order of calculations somewhat,
that it could be optimized further by using SI and DI as temporaries to avoid
memory accesses, and that you will have to supply the necessary wrapper for
calling from your C code.  You can also get better precision by scaling the
sine and cosine factors by 16384 (14 bits), since you only need a range of
-1..+1 (which would be -16384..16384 after scaling); in that case, change all
the 1024s to 16384.


xa	dw   ?	  ; note that these are ints instead of longs!
ya	dw   ?
za	dw   ?

	neg   WX
	mov   ax,yawCosFactor
	imul  WZ
	mov   cx,dx
	mov   bx,ax
	mov   ax,yawSinFactor
	imul  WX
	sub   ax,bx
	sbb   dx,cx
	mov   cx,1024
	idiv  cx	      ; faster than a loop!
	mov   za,ax
        mov   ax,yawSinFactor
	imul  WZ
	mov   cx,dx
	mov   bx,ax
	mov   ax,yawCosFactor
	imul  WX
	sub   ax,bx
	sbb   dx,cx
	mov   cx,1024
	idiv  cx	      ; faster than a loop!
	mov   xa,ax
	imul  rollCosFactor
	mov   cx,dx
	mov   bx,ax
	mov   ax,rollSinFactor
	imul  WY
	add   ax,bx
	adc   dx,cx
	mov   cx,1024
	idiv  cx
	add   ax,MX
	mov   WX,ax
	mov   ax,xa
	imul  pitchSinFactor
	mov   bx,ax
	mov   cx,dx
	mov   ax,za
	imul  pitchCosFactor
	sub   ax,bx
	sbb   dx,cx
	mov   cx,1024
	idiv  cx
	mov   ya,ax
        mov   ax,za
	imul  pitchSinFactor
	mov   cx,dx
	mov   bx,ax
	mov   ax,ya
	imul  pitchCosFactor
	add   ax,bx
	adc   dx,cx
	mov   cx,1024
	idiv  cx
	add   ax,MY
	mov   WY,ax
	mov   ax,ya
	imul  pitchSinFactor
	mov   cx,dx
	mov   bx,ax
	mov   ax,za
	imul  pitchCosFactor
	sub   ax,bx
	sbb   dx,cx
	mov   cx,1024
	idiv  cx
	add   ax,MZ
	jnz   l_1
	dec   ax
l_1:
	mov   cx,ax	       ; WZ doesn't need to be stored in memory
	mov   ax,word ptr AngularPerspFactor
	mov   dx,word ptr AngularPerspFactor+2
	idiv  cx	       ; APF / WZ
	mov   cx,ax	       ; store a copy for later
	imul  WX
	add   ax,400	       ; tmp*WX+400
	mov   _DX,ax
	mov   ax,WY
	mul   cx
	add   ax,300	       ; tmp*WY+300
	mov   _DY,ax


--
{backbone}!cs.cmu.edu!ralf  ARPA: RALF@CS.CMU.EDU   FIDO: Ralf Brown 1:129/3.1
BITnet: RALF%CS.CMU.EDU@CMUCCVMA   AT&Tnet: (412)268-3053 (school)   FAX: ask
DISCLAIMER?  Did  | It isn't what we don't know that gives us trouble, it's
I claim something?| what we know that ain't so.  --Will Rogers