Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site u1100a.UUCP Path: utzoo!watmath!clyde!burl!ulysses!gamma!pyuxww!u1100a!joec From: joec@u1100a.UUCP (Joe Carfagno) Newsgroups: net.lang.c Subject: Re: Unrolling string copy loops Message-ID: <799@u1100a.UUCP> Date: Thu, 4-Apr-85 08:43:29 EST Article-I.D.: u1100a.799 Posted: Thu Apr 4 08:43:29 1985 Date-Received: Fri, 5-Apr-85 04:20:13 EST Organization: Bell Communications Research, Piscataway, NJ Lines: 30 >>>> Having noticed a discussion of the benefit of loop unrolling on string copy (and other functions), I thought I'd share a similar experience here as it gave us BIG gains. The Sperry 1100 mainframe, on which a version of the UNIXtm system has been running since 1979, is a WORD ADDRESSABLE machine (and the words are 36 bit 1's complement). Needless to say, implementing a C compiler is somewhat interesting, especially in the area of char pointer dereferencing. At run time, the 20 bit psuedo-byte pointer is split into its word and "byte" components, and then the proper partial word is loaded from memory. This multi-instruction sequence is much more expensive than on your usual machine. Enter loop unrolling. Our large project (>1Mil lines C code) was profiled and found to use lots of time in the str*() functions. Noticing that the str* functions are sequentially processing their arguments (char 0, then 1, ..., then n), you can determine the starting partial word (1st 9 bits, 2nd, 3rd, or 4th) once and then predict what the next 9 bits you need are going to be (2nd, 3rd, 4th, or 1st from next word). For strcpy, you create a 4 by 4 table of entry points and away you go. Moral of the story - this technique cut the cpu cost of the str*() functions by 90% (they were already quite expensive), never to be seen again on our cpu profiles. Loop unrolling will work on other normal machines also since you process *cp, *(cp+1), *(cp+2), etc. at the cost of a few extra words of memory (because you're duplicating the load/store sequence with different offsets from your original cp pointer which you put in a register beforehand).