Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!rutgers!im4u!woton!riddle From: riddle@woton.UUCP (Prentiss Riddle ) Newsgroups: comp.databases Subject: Database problem of the week Message-ID: <923@woton.UUCP> Date: Tue, 11-Aug-87 12:24:59 EDT Article-I.D.: woton.923 Posted: Tue Aug 11 12:24:59 1987 Date-Received: Fri, 14-Aug-87 01:31:54 EDT Organization: Shriners Burns Institute, Galveston Lines: 71 Quandary: "Bob" [Someone was complaining that there wasn't enough traffic in this newsgroup, so I'll take that as a green light to raise my own little application design quandary. Please post replies of general or theoretical interest to comp.databases, but send anything that's not of general interest directly to me. Thanks.] I've inherited responsibility for a medium-large and growing database (say ~400 Kbytes, not including index files etc.) of experimental data which needs to be kept around and subjected to statistical analysis. Currently the data is stored in a single large Informix file in which each record consists of a key and 52 floating point lab values. The key is a composite of a study group number, a subject number and a time. +-----------+-------------+------+-------+-------+- -+--------+ | group no. | subject no. | time | val 1 | val 2 | ... | val 52 | +-----------+-------------+------+-------+-------+- -+--------+ So far so good. What makes me wonder if this is the best way to store the data is that over half of the lab values are nulls (stored as an impossibly large real number). In some of the fields, as a matter of fact, over 95% of the records have no data. One reason for this is that half of the fields contain what are called "extra" variables: since different study groups are subjected to different experiments, the meanings of these "extra" fields differ from study group to study group (which is *very* confusing!) and most study groups don't need them all. But even in the fields with fixed meanings, measurements are not always recorded for every subject at every time point. The database is still growing and is becoming large enough to be unwieldy. It seems to me that there ought to be a logical way to take advantage of the sparseness of the database by representing it in some other form. I've had three lines of thought on this, none of which has yet borne fruit: PLAN A: Instead of using very wide records with 52 lab values, why not use short records each containing a key and a single lab value? Each record might look like this: +-----------+-------------+------+-------+-------+ | group no. | subject no. | time | label | value | +-----------+-------------+------+-------+-------+ The "label" would be a short (3-4 byte) string explaining the meaning of the "value" field. Obviously most of our storage here goes for keys (and indexes!), but it *might* still take less storage than the current scheme. It would also remove confusion over the meaning of the "extra" experimental values and allow the choice of experimental variables recorded to be tailored precisely to each study. PLAN B: I'm told that the rigid structure of this problem would lend itself well to a hierarchical database. Unfortunately I don't know much about what that means, and it doesn't help us much if all we have is Informix. Is there a simple public-domain hierarchical database out there somewhere? PLAN C: This may not really be a problem for a DBMS. The kinds of queries we tend to do are mostly simple, and we'd be perfectly happy to separate the study groups into distinct files where they wouldn't interact at all. What I sometimes wish I had was a simple program for manipulating a sparse 3-D matrix with subject on one axis, time on another and variable on a third. It would need a user-friendly screen-oriented interface for entering and correcting data, and the ability to dump one- and two-dimensional slices of the matrix into ascii files to be fed to statistics programs. That sounds a lot like a 3-D spreadsheet to me, but I don't know where to find such a beast for little or no money. Also I'm not very familiar with the internals of spreadsheets. What sorts of data structures do they use -- hashing, perhaps? Any comments, advice or discussion would be welcome. --- Prentiss Riddle ("Aprendiz de todo, maestro de nada.") --- Opinions expressed are not necessarily those of Shriners Burns Institute. --- riddle@woton.UUCP {ihnp4,harvard,seismo}!ut-sally!im4u!woton!riddle