Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!gatech!hubcap!eos!eugene From: eos!eugene@eos.arc.nasa.gov (Eugene Miya) Newsgroups: comp.parallel Subject: Re: Very Large Data Sets - what are they? Keywords: your experience -- parallel I/O Message-ID: <4793@hubcap.UUCP> Date: 15 Mar 89 18:44:19 GMT Sender: fpst@hubcap.UUCP Lines: 48 Approved: parallel@hubcap.clemson.edu I have discussed some of this with David by mail, but some of the comments might be useful for network discussion. Well, really large data sets won't fit on a disk. Even existing disk systems. Some examples from my experience and watching others talk about this include planetary imaging data (Landsat, Voyager, Seasat). Some of this is 2-d, so you make rows (scan lines) and "splice" them into a tape (frequently 90 MB per frequency, 4-7 bandwidths, etc.). Some of the data is strictly linear, time varying, some some seismic data, again tapes. Voyager on a planetary encounter as an example generates way over 10 thousand magnetic tapes. Such images are typicall 1K pixels on a side and higher resolutions are planned (10^18 bits for a future mission isn't uncommon). The physical manifestation of a DB is a warehouse. Most of course never gets looked at (remember the very end of Raiders of the Lost Art? ->;-) want to know how the ozone hole was over looked?), you don't want to look at the data, you get a machine to do this. In fact most raster systems can't hold complete images, you only view sections. So you end up always looking at portions. Now the numbers I mention (90 MW etc.) might all sound small. These can fit, that's not the point, okay, 10 images in a GB roughly. But that's only bandwidth. And that's only images. And it is TOO easy for computer people to understand. Deceptively easy. Some signal processing involves high dimension FFT (Fast Fourier Transform) processing. You might want to look at the frequency space. Or non-linear dynamics looking at the phase space. FFTs have been proposed which would involve a 1K by 4-D FFT. Now to get that in most users just resort to simple linear sweeps (varying one index). Just big arrays. Its not completely clear how proposals like RAIDs will help this out (nor images for that matter). Grids and meshes, same things. Try to visualize 4-space, 5, 6, 7 (the problems exist), and we have developed crude mechanisms to look at simple cases (Tufte). Oh yes, some data compression can help. Must be done carefully. A different way of organizing the data might be like the airlines (SABRE). There's a CACM paper and a very conference on this topic. There are some meetings (IEEE) and others (one in Oregon at this moment) which talk about some of these issues. Distributed databases. Transaction processing. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." Domains, the zip codes of networks.