Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!umich!samsung!cs.utexas.edu!uunet!willett!ForthNet
From: ForthNet@willett.UUCP (ForthNet articles from GEnie)
Newsgroups: comp.lang.forth
Subject: Optimization
Message-ID: <246.UUL1.3#5129@willett.UUCP>
Date: 11 Jan 90 01:28:53 GMT
Organization: Latest Link in ForthNet Chain (Pittsburgh, PA)
Lines: 48


 Date: 01-09-90 (09:52)              Number: 1702 (Echo)
   To: MARK SMILEY                   Refer#: 1701
 From: PETE KOZIAR                     Read: NO
 Subj: INSTRUCTION TIMINGS           Status: PUBLIC MESSAGE

 I don't know if this has been said already, but you must be careful 
 about instruction timings on the 80x86/88 family.  The instruction 
 timings given assume that the pre-fetch queue is full. 

 Let me explain.  The 80x86 family has a queue of instructions waiting 
 to be executed.  It fills this queue during instructions that require a 
 lot of cycles to "calculate" (like multiply/divide, etc.).  If you do 
 this, then fetching the instructions from memory are "free," since they 
 are done in spare time. 

 Unfortunately, I believe that almost all of the 80x86 families must 
 purge this queue when a branch, jump, or call occurs (the 80386 may 
 not; I'm not sure and don't feel like digging in the manual). 

 Now, let's think about FORTH: lots of nice, small subroutines, i.e., 
 with lots of jumps and calls.  Bottom line: we tend to run with the 
 prefetch queue empty a large proportion of the time, so we need to add 
 in the number of fetches for each instruction. 

 A corollary to this is that you may wind up with faster code if you use 
 fewer slower instructions than many fast ones! 

 On the 8088 (i.e., XT-class machines), each fetch of each BYTE adds 4 
 whole cycles to the instruction time.  That ain't hay!  The '286 and 
 '386 do better, taking only 2 cycles per fetch, and that being 16 or 32 
 bits at a time, respectively, which may even represent multiple 
 instructions. 

 Remember, that's with no wait states; wait states add a cycle each. 

 Even worse, if you have cache, you need to worry if the instructions 
 executed in a tight loop are all in cache.  In a cached system, the 
 smaller the loop the better. 

 This is why Motorola hedges on their instruction timings for the 68020; 
 you almost need a computer program to figure out instruction timings, 
 and benchmarks are easier anyway. 
 ---
  * Via Qwikmail 2.01  The Baltimore Sun 
-----
This message came from GEnie via willett through a semi-automated process.
Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'