Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!gatech!hubcap!eos!eugene
From: eos!eugene@eos.arc.nasa.gov (Eugene Miya)
Newsgroups: comp.parallel
Subject: Re: Very Large Data Sets - what are they?
Keywords: your experience -- parallel I/O
Message-ID: <4793@hubcap.UUCP>
Date: 15 Mar 89 18:44:19 GMT
Sender: fpst@hubcap.UUCP
Lines: 48
Approved: parallel@hubcap.clemson.edu

I have discussed some of this with David by mail, but some of the comments
might be useful for network discussion.

Well, really large data sets won't fit on a disk.  Even existing disk
systems.  Some examples from my experience and watching others
talk about this include planetary imaging data (Landsat, Voyager, Seasat).
Some of this is 2-d, so you make rows (scan lines) and "splice" them
into a tape (frequently 90 MB per frequency, 4-7 bandwidths, etc.).
Some of the data is strictly linear, time varying, some some seismic
data, again tapes.  Voyager on a planetary encounter as an example
generates way over 10 thousand magnetic tapes.  Such images are typicall
1K pixels on a side and higher resolutions are planned (10^18 bits for
a future mission isn't uncommon).  The physical manifestation of a DB is
a warehouse.  Most of course never gets looked at (remember the very end
of Raiders of the Lost Art? ->;-)  want to know how the ozone hole was over
looked?), you don't want to look at the data, you get a machine to do this.
In fact most raster systems can't hold complete images, you only view
sections.  So you end up always looking at portions.  Now the numbers
I mention (90 MW etc.) might all sound small.  These can fit,
that's not the point, okay, 10 images in a GB roughly.  But that's
only bandwidth.  And that's only images.  And it is TOO easy for
computer people to understand.

Deceptively easy.  Some signal processing involves high
dimension FFT (Fast Fourier Transform) processing.  You might
want to look at the frequency space.  Or non-linear dynamics looking at
the phase space.  FFTs have been proposed which would involve a 1K
by 4-D FFT.  Now to get that in most users just resort to simple linear
sweeps (varying one index).  Just big arrays.  Its not completely
clear how proposals like RAIDs will help this out (nor images for that
matter).  Grids and meshes, same things.  Try to visualize 4-space,
5, 6, 7 (the problems exist), and we have developed crude mechanisms
to look at simple cases (Tufte).

Oh yes, some data compression can help.  Must be done carefully.

A different way of organizing the data might be like the airlines
(SABRE).  There's a CACM paper and a very conference on this topic.
There are some meetings (IEEE) and others (one in Oregon at this moment)
which talk about some of these issues.  Distributed databases.
Transaction processing.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  Domains, the zip codes of networks.