Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!cs.utexas.edu!rice!uw-beaver!sumax!polari!6sceng!6sigma!zuker
From: zuker@6sigma.UUCP (Hunter Zuker)
Newsgroups: comp.databases
Subject: Re: Balancing Normalization with Performance
Message-ID: <354@6sigma.UUCP>
Date: 21 Nov 90 23:18:09 GMT
References: <1092@cortex.med.jhu.edu> <1990Nov20.190315.13505@informix.com>
Distribution: usa
Organization: Six Sigma CASE, Inc.
Lines: 99

>(Randall Rhea) writes:
>>(Jim Rosenberg) writes:

>>I am building my first database, and am trying to understand how to balance
>>the benefits of normalization with performance requirements.

>This may be the biggest question in database design.  There is no simple
>answer for it.  Whether or not to normalize is a question that can 
>only be answered on a case-by-case basis.

Agreed.  When and where to de-normalize does not always have a clear
answer.

The first step in designing a good database is to model the database.
You can do this with something like Entity-Relationship diagrams or
I prefer the Extended Relational Analysis model.  From there you can
normalize. (I like third normal form, fourth normal form is generally
considered overkill.)

Once you have a normalized model, then is the time to consider what
might be intellegiently de-normalized.

Without normalization the possibility for data loss or data conflicts
increases dramatically.  Other reasons to normalize are to save disk space
and increase *overall* performance. (Integrity checks on un-normalized
databases can really slow things down.)

Of course if you normalize it also increases the likely-hood that
you've correctly thought through your database.  Which can only help
to make a better database.

>In general, I have had difficulty with over-normalized database designs.

Agreed.  But I would rather have an over-normalized database than one
that wasn't normalized or one that was not well de-normalized.

>>An example issue I am trying to answer:
>>  I am thinking of creating a relation that equates each string with
>>  an integer ID.

>A great example!  It shows how tricky the whole thing can be.  If there were
>only 10 possible values for your string, then I would certainly
>normalize and use the integer ID that you describe.  However, if there were
>500 possible string values, or 5000, then I may just stick the string in
>the same table as everything else and shine the integer ID.  If there
>were 5000 possible values for your string, then it may not be worth the
>extra table because:

>1. Any programs that need to look at the string would have to perform
>a sub-query.  In some report writer languages, this can be quite difficult.

The first sentence is the logical consequence of creating a string table.  
The second sentence is a potential limitation of the DBMS you are using.
Remember not all databases that call themselves relational are relational.

>2. You could increase the number of CPU and I/O resources required by
>the application.   This means that your application could perform slower.

The time it takes to do the sub-query needs to be weighed against any
integrity checks that are needed to validate an entry.

>3. Your string table would have to be maintained by somebody.  Keeping
>track of 5000 different string values is no trivial task.

True.  It depends on how static the string table is.  The more dynamic
the table, the less desirable it is.  Someone has to enter the 
table initially in either case.

>If there were only 10 possible string values, I WOULD normalize because:

>1. It could cut down on disk space requirements.

This is true in either case.  If you have a couple of million records,
then a string table of 5000 might not be unreasonable.  It could save
you a few MB or so.  If you only have 100, then a string table of 5000 
would be silly.

>2. The string reference table would enforce standards for the strings, e.g.
>spellings.

>3. The system would be easier to use, since a data entry person would not be
>required to type in a string; he could simply choose one from an approved list.

Both 2 and 3 are true regardless of the size of the string table.

Whether to include a zipcode table is equivalent to the above example.
A zipcode uniquely identifies a city and state.  If you have a million
or so addresses in your database, it might make sense to have a zipcode
table.  If you only have a thousand (like we do), then it is probably 
better to include city, state and zip in the address record.

>In database design, as in life, there are tradeoffs.

True.

Hunter Zuker
-- 
Hunter Zuker		Six Sigma CASE, Inc.	 13456 SE 27, Suite 210
zuker@6sigma.UUCP       (206) 643-6911           Bellevue, WA 98005-4211