Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!decwrl!sun-barr!newstop!sun!amdahl!key!sjc From: sjc@key.COM (Steve Correll) Newsgroups: comp.arch Subject: Re: End-of-buffer interrupt instruction (long) Message-ID: <2123@key.COM> Date: 13 Sep 90 00:51:27 GMT References: <2516@l.cc.purdue.edu> <6838.26e7f109@vax1.tcd.ie> Organization: Key Computer Labs, Fremont, CA Lines: 192 Herman Rubin writes: >There are quite a few situations in transaction processing which could use a >few intelligent instructions, like reading from a buffer with an automatic >interrupt when the buffer becomes empty. > In article <4043@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: > "Reading from a buffer" in what sense? Is this just an in-memory buffer > being read by a transaction processing application - in which case I > don't see how an *interrupt* would help, as a dumb old conditional > branch testing whether the number of characters left in the buffer was > zero would probably be *faster* than an interrupt (no tons of context so > save, etc.), or is it something else? In article , stephen@estragon.uchicago.edu (Stephen P Spackman) writes: > Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle > if not taken. There goes 16k cycles. You're telling me that servicing > a trap is going to take 16k cycles? First, let's be sure everybody understands what the proposed instruction is meant to do. I assume we're filling, emptying, or copying a buffer using a loop that deals with one character at a time, and we want a magic load or store which will interrupt us when we try to load or store past the end of the buffer. If that's wrong, please ignore the following and post a clarifying definition. I decided to perform a couple of crude experiments. First, to figure out how long it takes to visit the Unix kernel, activate a user signal handler, and return: #include static int handler() { (void) signal(SIGTERM, handler); } int main(argc, argv) int argc; char **argv; { int i, mypid = getpid(); (void) signal(SIGTERM, handler); for (i = 0; i < 100000; i++) /* loop for better timing */ (void) kill(mypid, SIGTERM); return 0; } Then to figure out how long it takes to copy an 8K buffer using comparison and conditional branch: # define BUFLEN 8192 char buf0[BUFLEN], buf1[BUFLEN]; void bar() { char *limit = buf0 + BUFLEN; char *p0 = buf0; char *p1 = buf1; while (p0 < limit) *p0++ = *p1++; } int main() { int i; for (i=1; i < 10000; i++) /* loop for better timing */ (void) bar(); return 0; } Then to figure out how much the proposed instruction would save us by eliminating the comparisons, I unrolled the loop: # define BUFLEN 8192 char buf0[BUFLEN], buf1[BUFLEN]; void bar() { char *limit = buf0 + BUFLEN; char *p0 = buf0; char *p1 = buf1; while (p0 < limit) { *p0++ = *p1++; *p0++ = *p1++; *p0++ = *p1++; *p0++ = *p1++; } } int main() { int i; for (i=1; i < 10000; i++) /* loop for better timing */ (void) bar(); return 0; } Now it's tricky to discern from the second and third programs the savings due to the proposed instruction. I compared the inner loops on a Sun 4: LY1: ! [internal] ldsb [%o4],%g1 stb %g1,[%o5] inc %o5 cmp %o5,%o3 bcs LY1 inc %o4 LY1: ! [internal] ldsb [%o4],%g1 stb %g1,[%o5] inc %o4 ldsb [%o4],%g2 inc %o5 stb %g2,[%o5] inc %o4 ldsb [%o4],%g3 inc %o5 stb %g3,[%o5] inc %o4 ldsb [%o4],%g4 inc %o5 stb %g4,[%o5] inc %o5 cmp %o5,%o3 bcs LY1 inc %o4 I believe that we want to figure out the cost of the "cmp" and not the "bcs", because even with the new instruction, you have to retain a branch or you don't have a loop; and on all but the last iteration, the branch is taken and the instruction in the delay slot is useful. The difference in cost between the second and third programs is (4 * (cost of branches + cost of compares)) - (1 * (cost of branches + cost of compares)), or 3 * (cost of branches + cost of compares). Assuming the cost of the branch equals the cost of the compare, then the cost of the remaining compares in program 3 is (1/3) * (1/2) = 1/6 the observed difference between program 2 and program 3. (Note that I started my experiments on a MipsCo machine, but changed to SPARC partly because the MipsCo architecture didn't require a separate compare instruction at all when the comparision was for equality or inequality!) I used cc -O on the Sun 4, measuring the sum of user and system times. The average of three trials for each was: Program 1: 31.5 sec Program 2: 46.2 sec Program 3: 41.4 sec Program 2 - Program 3: 46.2 - 41.4 = 4.8 sec Cost of remaining compares in Program 3: 4.8 / 6 = 0.8 sec Cost of trips through signal handler: 2.8 sec [note 1] Cost of Program 3 with "new instruction": 41.4 - 0.8 + 2.8 = 43.4 sec Decrease in execution time versus Program 2: 6% [note 2] [Note 1]: An additional experiment showed that the cost of the timing loop, kill(), and initialization code in program 1 is about 10% of the total, and we must divide by 10 because program 1 loops 10 times as often as the others, so the cost of trips through the signal handler is (0.9 * 31.5) / 10 = 2.8 sec. [Note 2]: Given the crudity of the experiment, I think one significant digit is the very most I can present without shame! I didn't try to eliminate the cost of the timing loop and initialization code in programs 2 and 3 because, since it's the same for both programs, it roughly cancels. Readers can draw their own conclusions. One might contend that buffers are usually bigger, magnifying the improvement. Or one might contend that realistic examples perform more processing than just moving characters from place to place, diminishing the percentage improvement. And someone who understands implementations would have to tell us the costs of adding such an instruction. My conclusions are (1) it's easy to spend quite a lot of time on a scenic tour through the kernel and (2) it's interesting to measure actual seconds rather than conceptual cycles. -- sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc Steve Correll