Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!cmcl2!lanl!jlg
From: jlg@lanl.ARPA (Jim Giles)
Newsgroups: net.arch
Subject: Re: Reasons For Large Main Memories
Message-ID: <7541@lanl.ARPA>
Date: Mon, 15-Sep-86 19:21:10 EDT
Article-I.D.: lanl.7541
Posted: Mon Sep 15 19:21:10 1986
Date-Received: Mon, 15-Sep-86 22:13:05 EDT
References: <1161@bu-cs.bu-cs.BU.EDU> <8529@duke.duke.UUCP> <672@ur-tut.UUCP> <7418@lanl.ARPA> <684@ur-tut.UUCP>
Reply-To: jlg@a.UUCP (Jim Giles)
Organization: Los Alamos National Laboratory
Lines: 58
Keywords: virtual memory appropriate

In article <684@ur-tut.UUCP> tuba@ur-tut.UUCP (Jon Krueger) writes:
>...
>	3. applications which can trivially manage their own
>	   spaces, with acceptable development and maintainence
>	   costs, which achieve higher levels of performance
>	   than their virtual memory equivalents
>...
>I invite anyone to submit examples of type 3.  I haven't got any, but I
>suspect there are a few.

Yes, quite a few!  It is my opinion that a large share of all scientific
computing falls into this category.  Certainly several of those I've
written (parts of) or used.

I wrote an EMP simulation code (Electro-Magnetic Pulse), for example, which
was essentially a large 3-d grid with a satellite model inside - you hit it
with x-rays and iterate Maxwell's equations on the grid.  Maxwell's
equations require you to keep 9 numbers for each position in the grid: 3
Electric field components, 3 Magnetic field components, and 3 current
density components (x-rays knock free charges off the body of the satellite
and the drift of these charges through the EM field cause current).

Now, take a rectangular grid that's 100 cells on a side: or 1 Million cells
(100^3).  That's 9 million floating point numbers for the grid.  This
clearly didn't fit in the CDC 7600 I was using at the time (1975), so
I had to 'page' the grid.  This was made fairly easy by the cyclical
nature of the required accesses: as soon as I was done with a plane
(that is:  I had updated a plane of the grid and all its neighbors) I
could flush it out to disk and start reading the furthest ahead plane
that would now fit (so that the input MIGHT complete before I needed the
data).

Note that the 'paging' activity took place in a natural dividing point
of the code: outside the 'plane' loop.  As a result of this, the inner
2 loops of the code (a 3-d loop, remember) were optimized as if the whole
problem fit memory.  Smaller grids were handled similarly - only the
'paging' might be done on a number of planes of the grid instead of one at
a time.  On a large grid, the 'paging' scheme could be moved into the next
lower loop and the 'pages' would have been half-planes or even strips of
the grid.  For a very small grid (one that almost fits memory), the paging
scheme could be modified to swap only a few planes (the actual technique
was to keep the whole grid of each component of the field that would fit
and only page the remaining components of the field).

Since all the paging activity was isolated in the outer loops of the code,
and since there was no additional address decoding circuitry in the memory
interface, I claim that this code clearly works faster than it would
have on a VM system.  I had no VM computer to benchmark against at the
time, so I don't have numbers to prove this claim (even if you were
likely to believe benchmark results that you couldn't test yourself).

Very many scientific codes have similar structure to the one described
above: all finite differencing codes - of course, finite element codes,
etc..  Since this is the case I claim that your class 3 set is well
occupied by real world production programs.

J. Giles
Los Alamos