Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!dciem!nrcaer!cognos!garyp
From: garyp@cognos.uucp (Gary Puckering)
Newsgroups: comp.databases
Subject: Re: Database problem of the week
Message-ID: <1304@smokey.UUCP>
Date: Fri, 21-Aug-87 10:54:19 EDT
Article-I.D.: smokey.1304
Posted: Fri Aug 21 10:54:19 1987
Date-Received: Sun, 23-Aug-87 10:12:39 EDT
References: <923@woton.UUCP>
Reply-To: garyp@cognos.UUCP (Gary Puckering)
Organization: Cognos Inc., Ottawa, Canada
Lines: 66

In article <923@woton.UUCP> riddle@woton.UUCP (Prentiss Riddle ) writes:
>Currently the data is stored in a single large Informix file in which each
>record consists of a key and 52 floating point lab values.  The key is a
>composite of a study group number, a subject number and a time. 
>
>	+-----------+-------------+------+-------+-------+-   -+--------+
>	| group no. | subject no. | time | val 1 | val 2 | ... | val 52 |
>	+-----------+-------------+------+-------+-------+-   -+--------+
>
>So far so good.  What makes me wonder if this is the best way to store the
>data is that over half of the lab values are nulls (stored as an impossibly
>large real number).  In some of the fields, as a matter of fact, over 95% of
>the records have no data.  One reason for this is that half of the fields
>contain what are called "extra" variables: since different study groups are
>subjected to different experiments, the meanings of these "extra" fields
>differ from study group to study group (which is *very* confusing!) and most
>study groups don't need them all.  But even in the fields with fixed
>meanings, measurements are not always recorded for every subject at every
>time point. 

One technique you could use is to try to classify the study groups into
different subclasses each with its own unique set of properties.  For
example:

  experiment(expId,groupNo,subjectNo,time) 

  animalExperiment(animalExpId,whiteCellCount,respirationRate,basalTemp)

  birdExperiment(birdExpId,wingBeatRate)

  plantExperiment(plantExpId,transpirationRate,soil_ph)

In this example, Ive introduced a surrogate key, expId, to substitute for
the compound key (groupNo,subjectNo,time).  This surrogate can simply
be a unique number generated  whenever a new experiment is inserted.
In addition, you will probably want to define a unique index on the 
columns (groupNo,subjectNo,time).

The other relations are specializations of the experiment relation.
In fact, their primary keys (animalExpId, birdExpId and plantExpId)
are defined on the same domain as expId.  If an experiment is an
animal experiment, then you will insert a row into both the experiment
relation and the animalExperiment relation.  The expId and the
animalExpId will have the same value.  If it is a bird
experiment (a bird is an animal) then you'll insert a row into the
experiment relation, the animalExperiment relation and the
birdExperiment relation.  Again, the id's will have the same value.

To access your, data, you can join these relations in various ways.
For example:

  select *
  from planExperiment p, experiment e
  where p.plantExpId = e.expId

will get you all the plant experiment information.

Using this technique, you can probably eliminate the use of nulls to
represent inapplicable properties (when you have this case, you
probably have a new subclass).  This leaves the null feature free to
be used to represent an unknown or missing value.
-- 

Gary Puckering        3755 Riverside Dr.
Cognos Incorporated   Ottawa, Ontario    {allegra,decvax,ihnp4,linus,pyramid}
(613) 738-1440        CANADA  K1G 3N3    !utzoo!dciem!nrcaer!cognos!garyp