Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!rutgers!im4u!woton!riddle
From: riddle@woton.UUCP (Prentiss Riddle )
Newsgroups: comp.databases
Subject: Database problem of the week
Message-ID: <923@woton.UUCP>
Date: Tue, 11-Aug-87 12:24:59 EDT
Article-I.D.: woton.923
Posted: Tue Aug 11 12:24:59 1987
Date-Received: Fri, 14-Aug-87 01:31:54 EDT
Organization: Shriners Burns Institute, Galveston
Lines: 71
Quandary: "Bob"


[Someone was complaining that there wasn't enough traffic in this newsgroup,
so I'll take that as a green light to raise my own little application
design quandary.  Please post replies of general or theoretical interest to
comp.databases, but send anything that's not of general interest directly
to me.  Thanks.]

I've inherited responsibility for a medium-large and growing database (say
~400 Kbytes, not including index files etc.) of experimental data which needs
to be kept around and subjected to statistical analysis. 

Currently the data is stored in a single large Informix file in which each
record consists of a key and 52 floating point lab values.  The key is a
composite of a study group number, a subject number and a time. 

	+-----------+-------------+------+-------+-------+-   -+--------+
	| group no. | subject no. | time | val 1 | val 2 | ... | val 52 |
	+-----------+-------------+------+-------+-------+-   -+--------+

So far so good.  What makes me wonder if this is the best way to store the
data is that over half of the lab values are nulls (stored as an impossibly
large real number).  In some of the fields, as a matter of fact, over 95% of
the records have no data.  One reason for this is that half of the fields
contain what are called "extra" variables: since different study groups are
subjected to different experiments, the meanings of these "extra" fields
differ from study group to study group (which is *very* confusing!) and most
study groups don't need them all.  But even in the fields with fixed
meanings, measurements are not always recorded for every subject at every
time point. 

The database is still growing and is becoming large enough to be unwieldy.
It seems to me that there ought to be a logical way to take advantage of
the sparseness of the database by representing it in some other form.  I've
had three lines of thought on this, none of which has yet borne fruit:

PLAN A:  Instead of using very wide records with 52 lab values, why not use
short records each containing a key and a single lab value?  Each record
might look like this:

	+-----------+-------------+------+-------+-------+
	| group no. | subject no. | time | label | value |
	+-----------+-------------+------+-------+-------+

The "label" would be a short (3-4 byte) string explaining the meaning of the
"value" field.  Obviously most of our storage here goes for keys (and
indexes!), but it *might* still take less storage than the current scheme. 
It would also remove confusion over the meaning of the "extra" experimental
values and allow the choice of experimental variables recorded to be tailored
precisely to each study. 

PLAN B: I'm told that the rigid structure of this problem would lend itself
well to a hierarchical database.  Unfortunately I don't know much about what
that means, and it doesn't help us much if all we have is Informix.  Is there
a simple public-domain hierarchical database out there somewhere? 

PLAN C: This may not really be a problem for a DBMS.  The kinds of queries we
tend to do are mostly simple, and we'd be perfectly happy to separate the
study groups into distinct files where they wouldn't interact at all.  What I
sometimes wish I had was a simple program for manipulating a sparse 3-D
matrix with subject on one axis, time on another and variable on a third.  It
would need a user-friendly screen-oriented interface for entering and
correcting data, and the ability to dump one- and two-dimensional slices of
the matrix into ascii files to be fed to statistics programs.  That sounds a
lot like a 3-D spreadsheet to me, but I don't know where to find such a beast
for little or no money.  Also I'm not very familiar with the internals of
spreadsheets.  What sorts of data structures do they use -- hashing, perhaps?

Any comments, advice or discussion would be welcome.

--- Prentiss Riddle ("Aprendiz de todo, maestro de nada.")
--- Opinions expressed are not necessarily those of Shriners Burns Institute.
--- riddle@woton.UUCP  {ihnp4,harvard,seismo}!ut-sally!im4u!woton!riddle