Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!att!linac!pacific.mps.ohio-state.edu!zaphod.mps.ohio-state.edu!swrinde!ucsd!ucbvax!agate!darkstar!central.cis.upenn.edu From: jms@central.cis.upenn.edu (Jonathan M. Smith) Newsgroups: comp.os.research Subject: Toy Experiment with Shared Memory Programming Message-ID: <8754@darkstar.ucsc.edu> Date: 9 Nov 90 13:49:26 GMT Sender: usenet@darkstar.ucsc.edu Organization: University of Pennsylvania Lines: 352 Approved: comp-os-research@jupiter.ucsc.edu Some Notes on Experiences programming shared memory 1. Purpose I, and others, have claimed that using shared memory support for programming can reduce the copying required in IPC. Being a "doubting Thomas", I decided to play a little bit and try to apply shared memory to a task for which other mechanisms (highly optimized) exist. Specifically, on a System V UNIX system, there exist both pipes and access to shared memory facilities. In my experiments, I employed a HP9000 Series 370 with 16 megabytes of main memory, running HPUX version 6.5 (deans.cis.upenn.edu). The system runs diskless except for a 70meg swap device, and my home directory is on "central.cis.upenn.edu". 2. Experimental Description The experiment was as follows: 1. Write a two-process program using pipes to communicate between reader and writer. 2. Rewrite the program using shared memory for the same communication. 3. Compare writing effort. 4. Measure the performance of the two implementations. 3. Pipes "piper.c" follows: #include #ifndef PIPESIZ #define PIPESIZ 8192 #endif main() { int pfd[2], pid, fork(), nread, nwrite; char buf[PIPESIZ]; if( pipe( pfd ) < 0 ) { fprintf( stderr, "Can't open pipe. Exiting.\n" ); exit( 1 ); } switch( (pid = fork()) ) { case -1: fprintf( stderr, "Can't fork. Exiting.\n" ); exit( 2 ); break; case 0: /* child */ close( pfd[1] ); while( (nread = read( pfd[0], buf, PIPESIZ ) ) > 0 ) { nwrite = write( 1, buf, nread ); if( nread != nwrite ) { fprintf( stderr, "IO error. Exiting.\n" ); exit( 3 ); } } break; default: /* parent */ close( pfd[0] ); while( (nread = read( 0, buf, PIPESIZ ) ) > 0 ) { nwrite = write( pfd[1], buf, nread ); if( nread != nwrite ) { fprintf( stderr, "IO error. Exiting.\n" ); exit( 4 ); } } } exit( 0 ); } The program was written as it is displayed, and was only changed to make the buffer size PIPESIZ so that the hidden buffering of pipe data provided by UNIX did not distort measurements. The size had originally been set to BUFSIZ. The program took about 15 minutes to type in and remove errors (mainly lexical). I am an experienced UNIX programmer, and IPC by pipe is the idiom, so this is to be expected. The program (if you're counting) is about 55 lines of source text. 4. Shared Memory I tried to keep the same model, that of a parent process reading from the standard input file, dumping data to a child which writes it to the standard output. Thus, in their IPC relationship, the parent is the "WRITER" and the child is the "READER". The shared data structure was a buffer+counter, and was set up to be the same size as in "piper.c". I realized that I needed mutual exclusion, and had to give it some thought. At first, I envisioned using the memory itself, but one problem seemed to be that C provides no access to an atomic "test-and-set" style instruction, and with preemptive multiprocessing, this is obviously necessary. Having used semaphores while writing code for my thesis, I was attracted to this mechanism for mutual exclusion, and the HPUX kernel provides semaphores. So I had to add code to A. allocate the shared memory for the buffer B. allocate the semaphore C. initialize the semaphore D. perform mutual exclusion when accessing the shared buffer E. explicitly deallocate the semaphore and shared memory on completion I was able to subtract code to A. create the pipe B. read/write from/to the pipe In some initial trials, I found that the shared memory program ("mem") performed far worse than "piper". This surprised me, so I profiled the code to find out where the time was going, and it was being eaten up by the calls to "semop()", which are used to implement P() and V(). But the cost of the calls was not the problem; it was the *number* of calls! For a file (my "mbox") with 184 8192-byte blocks, about 30-40 K calls to semop were made! This didn't sem possible from examining the code; and I must admit to a lot of hacking trying to "fix" the problem, which I was sure was a BUG. The problem was that pipes are well-integrated with process synchronization and scheduling. While executing a "V()" operation freed the semaphore, it did not give up the processor, so that the process which had just executed the "V" could now execute a "P" again immediately. After I saw this, I quickly rewrote the program with two semaphores so that better control over process control (such as in "sleep" and "wakeup") and response to events could be achieved. The correct (and reasonably fast) "mem.c" is given below: #include #include #include #include #include #ifndef PIPESIZ #define PIPESIZ 8192 #endif struct shared_buf { char buf[PIPESIZ]; int count; }; #define READ 0 #define WRITE 1 #define NSEM 2 int Semid; struct sembuf Sops[1]; get_sema() { if( (Semid = semget( IPC_PRIVATE, NSEM, 0600 )) < 0 ) { fprintf( stderr, "get of sema failed. Exiting.\n" ); exit( 1 ); } } set_sema() { int i; for( i = 0; i < NSEM; i += 1 ) { if( semctl( Semid, i, SETVAL, 1 ) < 0 ) { fprintf( stderr, "set of sema failed. Exiting.\n" ); exit( 1 ); } } return; } P( sema ) int sema; { Sops[0].sem_flg = 0; Sops[0].sem_num = sema; Sops[0].sem_op = -1; semop( Semid, Sops, 1 ); return; } V( sema ) int sema; { Sops[0].sem_flg = 0; Sops[0].sem_num = sema; Sops[0].sem_op = 1; semop( Semid, Sops, 1 ); return; } main() { int shmid, pid, fork(), shmget(); struct shared_buf *sbp; char *shmat(); if( (shmid = shmget(IPC_PRIVATE, sizeof( *sbp ), 0600) ) < 0 ) { fprintf( stderr, "Can't get shared mem. Exiting.\n" ); exit( 1 ); } sbp = (struct shared_buf *) shmat( shmid, (char *) 0, 0 ); if( sbp == (struct shared_buf *) -1 ) { fprintf( stderr, "Can't attach shared mem. Exiting.\n" ); exit( 1 ); } get_sema(); set_sema(); P( READ ); switch( (pid = fork()) ) { case -1: fprintf( stderr, "Can't fork. Exiting.\n" ); exit( 2 ); break; case 0: /* child */ do { P( READ ); write( 1, sbp->buf, sbp->count ); V( WRITE ); } while( sbp->count > 0 ); /* clean-up */ shmdt( sbp ); shmctl( shmid, IPC_RMID, 0 ); semctl( Semid, 0, IPC_RMID, 0 ); break; default: /* parent */ do { P( WRITE ); sbp->count = read( 0, sbp->buf, PIPESIZ ); V( READ ); } while( sbp->count > 0 ); } exit( 0 ); } This program takes about 130 lines of code, although the length of code inside main() is about the same. The main processing loops are much more elegant, but the setup is uglier. In order to force the WRITER (parent) to go first, the child is blocked by calling P( READ ) in the parent, thus blocking the child until the writer calls V( READ ) after it has read data into the buffer. It took me quite a long time (4-7 hours, depending on how you count, spread over several days) to get the program working, although I am now sure I could apply the paradigm effectively in a much shorter period of time, since I understand the details. Note that a program with multiple readers and writers would require a semaphore for mutual exclusion as well as mechanism for scheduling. (I originally included such a semaphore, MUTEX, but discarded it after I concluded it was unnecessary here). 5. Comparative Measurements The programs take the same time! My "mbox" file is a 1.5M text file. The programs were compiled with optimization turned on, and run using the file as input. Output was /dev/null, as is shown below: $ time ./piper /dev/null real 0m4.04s user 0m0.00s sys 0m0.78s $ time ./piper /dev/null real 0m3.96s user 0m0.02s sys 0m0.70s $ time ./mem /dev/null real 0m3.50s user 0m0.08s sys 0m0.62s $ time ./mem /dev/null real 0m3.46s user 0m0.00s sys 0m0.66s The "mem" programs seem to be slightly faster, but there is no substantial difference. Once again, I profiled, (using PROFDIR=. in the environment, so that I could examine the profiles from the child and parent separately); here are the top few lines for the parent: $ prof -m2765.mem mem %Time Seconds Cumsecs #Calls msec/call Name 78.8 0.52 0.52 184 2.83 _read 15.2 0.10 0.62 369 0.27 _semop 3.0 0.02 0.64 1 20. _main 3.0 0.02 0.66 1 20. _fork 0.0 0.00 0.66 185 0.00 _P and the child: $ prof -m2766.mem mem %Time Seconds Cumsecs #Calls msec/call Name 72.0 0.16 0.16 369 0.43 _semop 18.0 0.06 0.22 185 0.32 _write 0.0 0.00 0.22 1 0. _main 0.0 0.00 0.22 185 0.00 _P (there was a bogus timing for fork() here, and there are of course many lines of calls taking no time edited out for the reader's sanity....) For piper, the parent's profile was: $ prof -m2779.piper piper %Time Seconds Cumsecs #Calls msec/call Name 58.1 0.50 0.50 184 2.72 _read 37.2 0.32 0.82 184 1.74 _write 2.3 0.02 0.84 1 20. _main 2.3 0.02 0.86 1 20. _pipe and the child's: $ prof -m2780.piper piper %Time Seconds Cumsecs #Calls msec/call Name 82.0 0.36 0.36 184 1.96 _read 9.0 0.04 0.40 184 0.22 _write 4.5 0.02 0.42 1 20. _main 4.5 0.02 0.44 1 20. _pipe Thus, we see that it's the I/O (read+write=.58) and semaphores (.26) that are eating up the time; the shared memory version looks faster (.84 versus 1.30), and IPC is definitely faster; subtracting I/O due to external sources, gives .30 versus .72, or more than a factor of two. I will try to cook up an experiment a little later to get timings which are based only on the IPC time. 6. Summary (so far) But it is helpful to observe the following facts: 1. IPC may not be the dominant cost in any case (here, it's external I/O) 2. There was no major difference in the response times of "mem" and "piper" 3. This may be a best case comparison for pipes - pipes are highly optimized on UNIX and pipe flow control is tightly integrated with scheduling and process control. In fact, I suspect that fact 3 is the most reasonable explanation for fact 2, given the profile results.