Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!decwrl!sun-barr!newstop!sun!amdahl!key!sjc
From: sjc@key.COM (Steve Correll)
Newsgroups: comp.arch
Subject: Re: End-of-buffer interrupt instruction (long)
Message-ID: <2123@key.COM>
Date: 13 Sep 90 00:51:27 GMT
References: <2516@l.cc.purdue.edu> <6838.26e7f109@vax1.tcd.ie> <STEPHEN.90Sep10191125@estragon.uchicago.edu>
Organization: Key Computer Labs, Fremont, CA
Lines: 192

Herman Rubin writes:
>There are quite a few situations in transaction processing which could use a
>few intelligent instructions, like reading from a buffer with an automatic
>interrupt when the buffer becomes empty.

> In article <4043@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>    "Reading from a buffer" in what sense?  Is this just an in-memory buffer
>    being read by a transaction processing application - in which case I
>    don't see how an *interrupt* would help, as a dumb old conditional
>    branch testing whether the number of characters left in the buffer was
>    zero would probably be *faster* than an interrupt (no tons of context so
>    save, etc.), or is it something else?

In article <STEPHEN.90Sep10191125@estragon.uchicago.edu>, stephen@estragon.uchicago.edu (Stephen P Spackman) writes:
> Er. You have an 8k buffer. Each branch takes 1 cycle to test, 1 cycle
> if not taken. There goes 16k cycles. You're telling me that servicing
> a trap is going to take 16k cycles?

First, let's be sure everybody understands what the proposed instruction is
meant to do. I assume we're filling, emptying, or copying a buffer using a
loop that deals with one character at a time, and we want a magic load or
store which will interrupt us when we try to load or store past the end of
the buffer. If that's wrong, please ignore the following and post a clarifying
definition.

I decided to perform a couple of crude experiments. First, to figure out how
long it takes to visit the Unix kernel, activate a user signal handler, and
return:

  #include <signal.h>
 
  static int
  handler()
    {
    (void) signal(SIGTERM, handler);
    }

  int
  main(argc, argv)
    int argc;
    char **argv;
    {
    int i, mypid = getpid();

    (void) signal(SIGTERM, handler);

    for (i = 0; i < 100000; i++)	/* loop for better timing */
      (void) kill(mypid, SIGTERM);
    return 0;
    }

Then to figure out how long it takes to copy an 8K buffer using comparison
and conditional branch:

  # define BUFLEN 8192
  char buf0[BUFLEN], buf1[BUFLEN];

  void
  bar()
    {
    char *limit = buf0 + BUFLEN;
    char *p0 = buf0;
    char *p1 = buf1;
    while (p0 < limit)
      *p0++ = *p1++;
    }

  int
  main()
    {
    int i;

    for (i=1; i < 10000; i++)		/* loop for better timing */
      (void) bar();
    return 0;
    }

Then to figure out how much the proposed instruction would save us by
eliminating the comparisons, I unrolled the loop:

  # define BUFLEN 8192
  char buf0[BUFLEN], buf1[BUFLEN];

  void
  bar()
    {
    char *limit = buf0 + BUFLEN;
    char *p0 = buf0;
    char *p1 = buf1;
    while (p0 < limit)
      {
      *p0++ = *p1++;
      *p0++ = *p1++;
      *p0++ = *p1++;
      *p0++ = *p1++;
      }
    }

  int
  main()
    {
    int i;

    for (i=1; i < 10000; i++)		/* loop for better timing */
      (void) bar();
    return 0;
    }

Now it's tricky to discern from the second and third programs the savings due
to the proposed instruction. I compared the inner loops on a Sun 4:

LY1:					! [internal]
	ldsb	[%o4],%g1
	stb	%g1,[%o5]
	inc	%o5
	cmp	%o5,%o3
	bcs	LY1
	inc	%o4


LY1:					! [internal]
	ldsb	[%o4],%g1
	stb	%g1,[%o5]
	inc	%o4
	ldsb	[%o4],%g2
	inc	%o5
	stb	%g2,[%o5]
	inc	%o4
	ldsb	[%o4],%g3
	inc	%o5
	stb	%g3,[%o5]
	inc	%o4
	ldsb	[%o4],%g4
	inc	%o5
	stb	%g4,[%o5]
	inc	%o5
	cmp	%o5,%o3
	bcs	LY1
	inc	%o4

I believe that we want to figure out the cost of the "cmp" and not the "bcs",
because even with the new instruction, you have to retain a branch or you
don't have a loop; and on all but the last iteration, the branch is taken and
the instruction in the delay slot is useful. The difference in cost between
the second and third programs is (4 * (cost of branches + cost of compares))
- (1 * (cost of branches + cost of compares)), or 3 * (cost of branches +
cost of compares).

Assuming the cost of the branch equals the cost of the compare, then the
cost of the remaining compares in program 3 is (1/3) * (1/2) = 1/6 the
observed difference between program 2 and program 3.

(Note that I started my experiments on a MipsCo machine, but changed to SPARC
partly because the MipsCo architecture didn't require a separate compare
instruction at all when the comparision was for equality or inequality!)

I used cc -O on the Sun 4, measuring the sum of user and system times. The
average of three trials for each was:

Program 1:					31.5 sec
Program 2:					46.2 sec
Program 3:					41.4 sec

Program 2 - Program 3:				46.2 - 41.4 = 4.8 sec
Cost of remaining compares in Program 3:	4.8 / 6 = 0.8 sec
Cost of trips through signal handler:		2.8 sec	[note 1]
Cost of Program 3 with "new instruction":	41.4 - 0.8 + 2.8 = 43.4 sec

Decrease in execution time versus Program 2:	6%	[note 2]

[Note 1]: An additional experiment showed that the cost of the timing loop,
  kill(), and initialization code in program 1 is about 10% of the total, and
  we must divide by 10 because program 1 loops 10 times as often as the others,
  so the cost of trips through the signal handler is (0.9 * 31.5) / 10 = 2.8
  sec.

[Note 2]: Given the crudity of the experiment, I think one significant digit
  is the very most I can present without shame! I didn't try to eliminate the
  cost of the timing loop and initialization code in programs 2 and 3 because,
  since it's the same for both programs, it roughly cancels.

Readers can draw their own conclusions. One might contend that buffers are
usually bigger, magnifying the improvement. Or one might contend that realistic
examples perform more processing than just moving characters from place to
place, diminishing the percentage improvement. And someone who understands
implementations would have to tell us the costs of adding such an instruction.

My conclusions are (1) it's easy to spend quite a lot of time on a scenic tour
through the kernel and (2) it's interesting to measure actual seconds rather
than conceptual cycles.
-- 
sjc@key.com or ...{sun,pyramid}!pacbell!key!sjc 		Steve Correll