Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!umich!samsung!cs.utexas.edu!uunet!willett!ForthNet From: ForthNet@willett.UUCP (ForthNet articles from GEnie) Newsgroups: comp.lang.forth Subject: Optimization Message-ID: <246.UUL1.3#5129@willett.UUCP> Date: 11 Jan 90 01:28:53 GMT Organization: Latest Link in ForthNet Chain (Pittsburgh, PA) Lines: 48 Date: 01-09-90 (09:52) Number: 1702 (Echo) To: MARK SMILEY Refer#: 1701 From: PETE KOZIAR Read: NO Subj: INSTRUCTION TIMINGS Status: PUBLIC MESSAGE I don't know if this has been said already, but you must be careful about instruction timings on the 80x86/88 family. The instruction timings given assume that the pre-fetch queue is full. Let me explain. The 80x86 family has a queue of instructions waiting to be executed. It fills this queue during instructions that require a lot of cycles to "calculate" (like multiply/divide, etc.). If you do this, then fetching the instructions from memory are "free," since they are done in spare time. Unfortunately, I believe that almost all of the 80x86 families must purge this queue when a branch, jump, or call occurs (the 80386 may not; I'm not sure and don't feel like digging in the manual). Now, let's think about FORTH: lots of nice, small subroutines, i.e., with lots of jumps and calls. Bottom line: we tend to run with the prefetch queue empty a large proportion of the time, so we need to add in the number of fetches for each instruction. A corollary to this is that you may wind up with faster code if you use fewer slower instructions than many fast ones! On the 8088 (i.e., XT-class machines), each fetch of each BYTE adds 4 whole cycles to the instruction time. That ain't hay! The '286 and '386 do better, taking only 2 cycles per fetch, and that being 16 or 32 bits at a time, respectively, which may even represent multiple instructions. Remember, that's with no wait states; wait states add a cycle each. Even worse, if you have cache, you need to worry if the instructions executed in a tight loop are all in cache. In a cached system, the smaller the loop the better. This is why Motorola hedges on their instruction timings for the 68020; you almost need a computer program to figure out instruction timings, and benchmarks are easier anyway. --- * Via Qwikmail 2.01 The Baltimore Sun ----- This message came from GEnie via willett through a semi-automated process. Report problems to: 'uunet!willett!dwp' or 'willett!dwp@gateway.sei.cmu.edu'