Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!dimacs.rutgers.edu!mips!prls!pyramid!infmx!randall
From: randall@informix.com (Randall Rhea)
Newsgroups: comp.databases
Subject: Re: Balancing Normalization with Performance
Message-ID: <1990Nov20.190315.13505@informix.com>
Date: 20 Nov 90 19:03:15 GMT
References: <1092@cortex.med.jhu.edu>
Sender: news@informix.com (Usenet News)
Distribution: usa
Organization: Informix Software, Inc.
Lines: 70

In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes:
>Hello database gurus.
>
>I am building my first database, and am trying to understand how to balance
>the benefits of normalization with performance requirements.  I am 
>interested in any heuristics or literate on deciding when to scrap 
>normalization to get a performance kick.  

This may be the biggest question in database design.  There is no simple
answer for it.  Whether or not to normalize is a question that can 
only be answered on a case-by-case basis.

In general, I have had difficulty with over-normalized database designs.
There are analysts who have read a lot of database books, but have little
experience in actually developing real applications ... these are the folks
that tend to over-normalize.  I find myself needing to write a fairly simple
report, but it turns into a complicated program because data is all over
the place.  As you correctly point out, one gets tired of writing multiple
sub-queries.

I could go on and on about database design theory, but you could find one
of those books that discuss it better than I could.  I will, however,
refer to your specific example:

>An example issue I am trying to answer:
>  I have a domain that can have one of a given set of character string values.
>  I am thinking of creating a relation that equates each string with
>  an integer ID.  The benefits would be less memory used in the relation with
>  that domain (since an integer would be stored for each tuple rather than
>  an n byte length string), faster comparisons in a qualification (since 
>  comparisons on that domain would be between integers and not strings), and
>  less chance of typo's in the database (since could not enter a string that
>  cannot be translated to an ID by a subquery into the ID relation).  The 
>  cost, of course, is the subquery required whenever you want to access the
>  domain.  Does the answer depend on the length of the strings, etc ...

A great example!  It shows how tricky the whole thing can be.  If there were
only 10 possible values for your string, then I would certainly
normalize and use the integer ID that you describe.  However, if there were
500 possible string values, or 5000, then I may just stick the string in
the same table as everything else and shine the integer ID.  If there
were 5000 possible values for your string, then it may not be worth the
extra table because:

1. Any programs that need to look at the string would have to perform
a sub-query.  In some report writer languages, this can be quite difficult.

2. You could increase the number of CPU and I/O resources required by
the application.   This means that your application could perform slower.

3. Your string table would have to be maintained by somebody.  Keeping
track of 5000 different string values is no trivial task.

If there were only 10 possible string values, I WOULD normalize because:

1. It could cut down on disk space requirements.

2. The string reference table would enforce standards for the strings, e.g.
spellings.

3. The system would be easier to use, since a data entry person would not be
required to type in a string; he could simply choose one from an approved list.

In database design, as in life, there are tradeoffs.

-- 

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Randall Rhea                                          Informix Software, Inc. 
Senior Programmer/Analyst, MIS                    uunet!pyramid!infmx!randall