Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!usc!elroy.jpl.nasa.gov!sdd.hp.com!zaphod.mps.ohio-state.edu!rpi!crdgw1!uunet!pmafire!mica.inel.gov!gem-hy!cdm
From: cdm@gem-hy.Berkeley.EDU (Dale Cook)
Newsgroups: comp.databases
Subject: Re: Balancing Normalization with Performance
Message-ID: <1990Nov21.185907.10061@inel.gov>
Date: 21 Nov 90 18:59:07 GMT
References: <1092@cortex.med.jhu.edu> <1990Nov20.190315.13505@informix.com>
Sender: news@inel.gov
Reply-To: cdm@gem-hy.Berkeley.EDU (Dale Cook)
Distribution: usa
Organization: Idaho National Engineering Laboratory, Idaho Falls, Idaho
Lines: 134

In article <1990Nov20.190315.13505@informix.com>, randall@informix.com
(Randall Rhea) writes:
|> In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim
Rosenberg) writes:
|> >Hello database gurus.
|> >
|> >I am building my first database, and am trying to understand how to balance
|> >the benefits of normalization with performance requirements.  I am 
|> >interested in any heuristics or literate on deciding when to scrap 
|> >normalization to get a performance kick.  
|> 
|> This may be the biggest question in database design.  There is no simple
|> answer for it.  Whether or not to normalize is a question that can 
|> only be answered on a case-by-case basis.
|> 
|> In general, I have had difficulty with over-normalized database designs.
|> There are analysts who have read a lot of database books, but have little
|> experience in actually developing real applications ... these are the folks
|> that tend to over-normalize.  I find myself needing to write a fairly simple
|> report, but it turns into a complicated program because data is all over
|> the place.  As you correctly point out, one gets tired of writing multiple
|> sub-queries.
|> 
I think you are confusing design with implementation.  Database _designs_,
IMHO, should always be normalized.  You then examine the application
requirements to determine if it is best to denormalize specific aspects
of your design during implementation.  Typically, these are _performance_
requirements, *not* programmer convenience.  (However, end-user convenience 
is a legitimate requirement).  Remember that the goal of the 
design is to understand your data and represent the data structure *as it
logically exists* so that others can understand it as well, as well as the
obvious goals of minimizing the impact of future changes and maximizing
data sharability.

While I sympathize with the problems of "data all over the place", it does
not in and of itself justify denormalization (especially for programmers).
I've seen many a database where exactly that has been done, which has made
changing that database a maintenance nightmare.

|> I could go on and on about database design theory, but you could find one
|> of those books that discuss it better than I could.  I will, however,
|> refer to your specific example:
|> 
|> >An example issue I am trying to answer:
|> >  I have a domain that can have one of a given set of character
string values.
|> >  I am thinking of creating a relation that equates each string with
|> >  an integer ID.  The benefits would be less memory used in the
relation with
|> >  that domain (since an integer would be stored for each tuple rather than
|> >  an n byte length string), faster comparisons in a qualification (since 
|> >  comparisons on that domain would be between integers and not
strings), and
|> >  less chance of typo's in the database (since could not enter a
string that
|> >  cannot be translated to an ID by a subquery into the ID relation).  The 
|> >  cost, of course, is the subquery required whenever you want to access the
|> >  domain.  Does the answer depend on the length of the strings, etc ...
|> 
|> A great example!  It shows how tricky the whole thing can be.  If there were
|> only 10 possible values for your string, then I would certainly
|> normalize and use the integer ID that you describe.  However, if there were
|> 500 possible string values, or 5000, then I may just stick the string in
|> the same table as everything else and shine the integer ID.  If there
|> were 5000 possible values for your string, then it may not be worth the
|> extra table because:
|> 
|> 1. Any programs that need to look at the string would have to perform
|> a sub-query.  In some report writer languages, this can be quite difficult.
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
So what?  That's why you get the big bucks, isn't it?
|> 
|> 2. You could increase the number of CPU and I/O resources required by
|> the application.   This means that your application could perform slower.

What type of application?  Queries may be slower, but updates may be 
significantly faster!  You have to consider the whole system, not the
piece of it you're working on today.  If your data is static, it may
be worth it.  If it's volatile, you'll pay dearly for denormalization.
|> 
|> 3. Your string table would have to be maintained by somebody.  Keeping
|> track of 5000 different string values is no trivial task.

No kidding.  But if you embed these strings in one table, what happens
to the poor slob needs those strings in another table???  Is (s)he supposed
to replicate this mess in the new table?  So much for disk savings.  And
suppose someone changes the value of one of these strings, but neglects
to change all of the other like values?  Your data is now inconsistent.
Your boss gets one value on his report, and his /her boss gets another on
theirs.  Your boss will want to know why (s)he looked like an a-hole in
the meeting this morning.
|> 
|> If there were only 10 possible string values, I WOULD normalize because:
|> 
|> 1. It could cut down on disk space requirements.

How do we know this?  The number of distinct values is of less importance
than other factors such as the number of tables related to this one and 
data volatility.
|> 
|> 2. The string reference table would enforce standards for the strings, e.g.
|> spellings.

Why would one care about spellings if one had 10 possible values, but not
care if there were 5000?
|> 
|> 3. The system would be easier to use, since a data entry person would not be
|> required to type in a string; he could simply choose one from an
approved list.
|> 
|> In database design, as in life, there are tradeoffs.
|> 
There are no tradeoffs in design, only implementation.  Programmer 
convenience is not a legitimate tradeoff, or at least pretty damn low
on the list.  Balance query time against update time.  Consider data
volatility when considering storing summary information.  Try to look ahead in
time.  Will the data be used by other applications?  Will it change?
My point is that denormalization should not be undertaken lightly.  You
should understand the consequences of it, both near and long term.
The benefits to the entire system should be clearly identified.

--- Dale Cook   cdm@inel.gov
========== long legal disclaimer follows, press n to skip ===========
^L
Neither the United States Government or the Idaho National Engineering
Laboratory or any of their employees, makes any warranty, whatsoever,
implied, or assumes any legal liability or responsibility regarding any
information, disclosed, or represents that its use would not infringe
privately owned rights.  No specific reference constitutes or implies
endorsement, recommendation, or favoring by the United States
Government or the Idaho National Engineering Laboratory.  The views and
opinions expressed herein do not necessarily reflect those of the
United States Government or the Idaho National Engineering Laboratory,
and shall not be used for advertising or product endorsement purposes.