Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!swrinde!ucsd!ucbvax!agate!darkstar!central.cis.upenn.edu
From: jms@central.cis.upenn.edu (Jonathan M. Smith)
Newsgroups: comp.os.research
Subject: Toy Experiment with Shared Memory Programming
Message-ID: <8754@darkstar.ucsc.edu>
Date: 9 Nov 90 13:49:26 GMT
Sender: usenet@darkstar.ucsc.edu
Organization: University of Pennsylvania
Lines: 352
Approved: comp-os-research@jupiter.ucsc.edu

Some Notes on Experiences programming shared memory

1. Purpose
I, and others, have claimed that using shared memory support for
programming can reduce the copying required in IPC. Being a "doubting
Thomas", I decided to play a little bit and try to apply shared memory
to a task for which other mechanisms (highly optimized) exist. Specifically,
on a System V UNIX system, there exist both pipes and access to shared
memory facilities. In my experiments, I employed a HP9000 Series 370 with
16 megabytes of main memory, running HPUX version 6.5 (deans.cis.upenn.edu).
The system runs diskless except for a 70meg swap device, and my home
directory is on "central.cis.upenn.edu". 

2. Experimental Description
The experiment was as follows: 
	1. Write a two-process program using pipes to communicate
	between reader and writer.
	2. Rewrite the program using shared memory for the same
	communication.
	3. Compare writing effort.
	4. Measure the performance of the two implementations.

3. Pipes
"piper.c" follows:
#include <stdio.h>

#ifndef PIPESIZ
#define PIPESIZ 8192
#endif

main()
{
	int pfd[2], pid, fork(), nread, nwrite;
	char buf[PIPESIZ];

	if( pipe( pfd ) < 0 )
	{
		fprintf( stderr, "Can't open pipe. Exiting.\n" );
		exit( 1 );
	}
	
	switch( (pid = fork()) )
	{
	case -1:
		fprintf( stderr, "Can't fork. Exiting.\n" );
		exit( 2 );
		break;

	case 0: /* child */
		close( pfd[1] );
		while( (nread = read( pfd[0], buf, PIPESIZ ) ) > 0 )
		{
			nwrite = write( 1, buf, nread );
			if( nread != nwrite )
			{
				fprintf( stderr, "IO error. Exiting.\n" );
				exit( 3 );
			}
		}

		break;

	default: /* parent */
		close( pfd[0] );
		while( (nread = read( 0, buf, PIPESIZ ) ) > 0 )
		{
			nwrite = write( pfd[1], buf, nread );
			if( nread != nwrite )
			{
				fprintf( stderr, "IO error. Exiting.\n" );
				exit( 4 );
			}
		}

	}

	exit( 0 );
}

The program was written as it is displayed, and was only changed to
make the buffer size PIPESIZ so that the hidden buffering of pipe data
provided by UNIX did not distort measurements. The size had originally been
set to BUFSIZ. The program took about 15 minutes to type in and remove errors
(mainly lexical). I am an experienced UNIX programmer, and IPC by pipe is the
idiom, so this is to be expected. The program (if you're counting) is about
55 lines of source text.

4. Shared Memory
I tried to keep the same model, that of a parent process reading from the
standard input file, dumping data to a child which writes it to the standard
output. Thus, in their IPC relationship, the parent is the "WRITER" and the
child is the "READER". The shared data structure was a buffer+counter, and
was set up to be the same size as in "piper.c". I realized that I needed 
mutual exclusion, and had to give it some thought. At first, I envisioned
using the memory itself, but one problem seemed to be that C provides no
access to an atomic "test-and-set" style instruction, and with preemptive
multiprocessing, this is obviously necessary. Having used semaphores while
writing code for my thesis, I was attracted to this mechanism for mutual
exclusion, and the HPUX kernel provides semaphores. So I had to add code to
	A. allocate the shared memory for the buffer
	B. allocate the semaphore
	C. initialize the semaphore
	D. perform mutual exclusion when accessing the shared buffer
	E. explicitly deallocate the semaphore and shared memory on
	   completion
I was able to subtract code to
	A. create the pipe
	B. read/write from/to the pipe

In some initial trials, I found that the shared memory program ("mem")
performed far worse than "piper". This surprised me, so I profiled the code
to find out where the time was going, and it was being eaten up by the
calls to "semop()", which are used to implement P() and V(). But the cost
of the calls was not the problem; it was the *number* of calls! For a file
(my "mbox") with 184 8192-byte blocks, about 30-40 K calls to semop were
made! This didn't sem possible from examining the code; and I must admit to
a lot of hacking trying to "fix" the problem, which I was sure was a BUG.

The problem was that pipes are well-integrated with process synchronization
and scheduling. While executing a "V()" operation freed the semaphore, it
did not give up the processor, so that the process which had just executed
the "V" could now execute a "P" again immediately. After I saw this, I quickly
rewrote the program with two semaphores so that better control over process
control (such as in "sleep" and "wakeup") and response to events could be
achieved.

The correct (and reasonably fast) "mem.c" is given below:

#include <stdio.h>
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/sem.h>

#ifndef PIPESIZ
#define PIPESIZ 8192
#endif

struct shared_buf {
	char buf[PIPESIZ];
	int count;
};

#define READ 0
#define WRITE 1
#define NSEM 2

int Semid;
struct sembuf Sops[1];

get_sema()
{
	if( (Semid = semget( IPC_PRIVATE, NSEM, 0600 )) < 0 )
	{
		fprintf( stderr, "get of sema failed. Exiting.\n" );
		exit( 1 );
	}
}

set_sema()
{
	int i;

	for( i = 0; i < NSEM; i += 1 )
	{
		if( semctl( Semid, i, SETVAL, 1 ) < 0 )
		{
			fprintf( stderr, "set of sema failed. Exiting.\n" );
			exit( 1 );
		}
	}
	
	return;
}

P( sema )
int sema;
{ 
	Sops[0].sem_flg = 0;
	Sops[0].sem_num = sema;
  	Sops[0].sem_op = -1;
	semop( Semid, Sops, 1 );

	return;
}

V( sema )
int sema;
{
	Sops[0].sem_flg = 0;
	Sops[0].sem_num = sema;
	Sops[0].sem_op = 1;
	semop( Semid, Sops, 1 );

	return;
}

main()
{
	int shmid, pid, fork(), shmget();
	struct shared_buf *sbp;
	char *shmat();

	if( (shmid = shmget(IPC_PRIVATE, sizeof( *sbp ), 0600) ) < 0 )
	{
		fprintf( stderr, "Can't get shared mem. Exiting.\n" );
		exit( 1 );
	}
	sbp = (struct shared_buf *) shmat( shmid, (char *) 0, 0 );
	if( sbp == (struct shared_buf *) -1 )
 	{
		fprintf( stderr, "Can't attach shared mem. Exiting.\n" );
		exit( 1 );
	} 
	get_sema();
	set_sema();
	P( READ );
	
	switch( (pid = fork()) )
	{
	case -1:
		fprintf( stderr, "Can't fork. Exiting.\n" );
		exit( 2 );
		break;

	case 0: /* child */
		do
		{
			P( READ );
			write( 1, sbp->buf, sbp->count );
			V( WRITE );
		} while( sbp->count > 0 );

		/* clean-up */
		shmdt( sbp );
		shmctl( shmid, IPC_RMID, 0 );
		semctl( Semid, 0, IPC_RMID, 0 );
		
		break;

	default: /* parent */
		do
		{
			P( WRITE );
			sbp->count = read( 0, sbp->buf, PIPESIZ );
			V( READ );
		} while( sbp->count > 0 );
	}

	exit( 0 );
}

This program takes about 130 lines of code, although the length of code 
inside main() is about the same. The main processing loops are much more
elegant, but the setup is uglier. In order to force the WRITER (parent) to
go first, the child is blocked by calling P( READ ) in the parent, thus
blocking the child until the writer calls V( READ ) after it has read data
into the buffer. It took me quite a long time (4-7 hours, depending on how
you count, spread over several days) to get the program working, although I
am now sure I could apply the paradigm effectively in a much shorter period
of time, since I understand the details. Note that a program with multiple
readers and writers would require a semaphore for mutual exclusion as well as
mechanism for scheduling. (I originally included such a semaphore, MUTEX,
but discarded it after I concluded it was unnecessary here).

5. Comparative Measurements
The programs take the same time!
My "mbox" file is a 1.5M text file.
The programs were compiled with optimization turned on,
and run using the file as input. Output was /dev/null, as is shown below:

$ time ./piper <mbox >/dev/null

real    0m4.04s
user    0m0.00s
sys     0m0.78s
$ time ./piper <mbox >/dev/null

real    0m3.96s
user    0m0.02s
sys     0m0.70s
$ time ./mem <mbox >/dev/null  

real    0m3.50s
user    0m0.08s
sys     0m0.62s
$ time ./mem <mbox >/dev/null

real    0m3.46s
user    0m0.00s
sys     0m0.66s

The "mem" programs seem to be slightly faster, but there is no substantial
difference. Once again, I profiled, (using PROFDIR=. in the environment, so
that I could examine the profiles from the child and parent separately); here
are the top few lines for the parent:

$ prof -m2765.mem mem
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  78.8    0.52    0.52     184      2.83    _read
  15.2    0.10    0.62     369      0.27    _semop
   3.0    0.02    0.64       1     20.      _main
   3.0    0.02    0.66       1     20.      _fork
   0.0    0.00    0.66     185      0.00    _P

and the child:

$ prof -m2766.mem mem
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  72.0    0.16    0.16     369      0.43    _semop
  18.0    0.06    0.22     185      0.32    _write
   0.0    0.00    0.22       1      0.      _main
   0.0    0.00    0.22     185      0.00    _P

(there was a bogus timing for fork() here, and
there are of course many lines of calls taking no time edited out
for the reader's sanity....)

For piper, the parent's profile was:
$ prof -m2779.piper piper
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  58.1    0.50    0.50     184      2.72    _read
  37.2    0.32    0.82     184      1.74    _write
   2.3    0.02    0.84       1     20.      _main
   2.3    0.02    0.86       1     20.      _pipe

and the child's:
$ prof -m2780.piper piper
 %Time Seconds Cumsecs  #Calls   msec/call  Name
  82.0    0.36    0.36     184      1.96    _read
   9.0    0.04    0.40     184      0.22    _write
   4.5    0.02    0.42       1     20.      _main
   4.5    0.02    0.44       1     20.      _pipe


Thus, we see that it's the I/O (read+write=.58) and semaphores (.26)
that are eating up the time; the shared memory version looks faster
(.84 versus 1.30), and IPC is definitely faster; subtracting I/O due to
external sources, gives .30 versus .72, or more than a factor of two.
I will try to cook up an experiment a little later
to get timings which are based only on the IPC time.

6. Summary (so far)
But it is helpful to observe the following facts:
1. IPC may not be the dominant cost in any case (here, it's external I/O)
2. There was no major difference in the response times of "mem" and "piper"
3. This may be a best case comparison for pipes - pipes are highly optimized
on UNIX and pipe flow control is tightly integrated with scheduling and
process control.

In fact, I suspect that fact 3 is the most reasonable explanation for fact 2,
given the profile results.