Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!usc!elroy.jpl.nasa.gov!sdd.hp.com!zaphod.mps.ohio-state.edu!rpi!crdgw1!uunet!pmafire!mica.inel.gov!gem-hy!cdm From: cdm@gem-hy.Berkeley.EDU (Dale Cook) Newsgroups: comp.databases Subject: Re: Balancing Normalization with Performance Message-ID: <1990Nov21.185907.10061@inel.gov> Date: 21 Nov 90 18:59:07 GMT References: <1092@cortex.med.jhu.edu> <1990Nov20.190315.13505@informix.com> Sender: news@inel.gov Reply-To: cdm@gem-hy.Berkeley.EDU (Dale Cook) Distribution: usa Organization: Idaho National Engineering Laboratory, Idaho Falls, Idaho Lines: 134 In article <1990Nov20.190315.13505@informix.com>, randall@informix.com (Randall Rhea) writes: |> In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes: |> >Hello database gurus. |> > |> >I am building my first database, and am trying to understand how to balance |> >the benefits of normalization with performance requirements. I am |> >interested in any heuristics or literate on deciding when to scrap |> >normalization to get a performance kick. |> |> This may be the biggest question in database design. There is no simple |> answer for it. Whether or not to normalize is a question that can |> only be answered on a case-by-case basis. |> |> In general, I have had difficulty with over-normalized database designs. |> There are analysts who have read a lot of database books, but have little |> experience in actually developing real applications ... these are the folks |> that tend to over-normalize. I find myself needing to write a fairly simple |> report, but it turns into a complicated program because data is all over |> the place. As you correctly point out, one gets tired of writing multiple |> sub-queries. |> I think you are confusing design with implementation. Database _designs_, IMHO, should always be normalized. You then examine the application requirements to determine if it is best to denormalize specific aspects of your design during implementation. Typically, these are _performance_ requirements, *not* programmer convenience. (However, end-user convenience is a legitimate requirement). Remember that the goal of the design is to understand your data and represent the data structure *as it logically exists* so that others can understand it as well, as well as the obvious goals of minimizing the impact of future changes and maximizing data sharability. While I sympathize with the problems of "data all over the place", it does not in and of itself justify denormalization (especially for programmers). I've seen many a database where exactly that has been done, which has made changing that database a maintenance nightmare. |> I could go on and on about database design theory, but you could find one |> of those books that discuss it better than I could. I will, however, |> refer to your specific example: |> |> >An example issue I am trying to answer: |> > I have a domain that can have one of a given set of character string values. |> > I am thinking of creating a relation that equates each string with |> > an integer ID. The benefits would be less memory used in the relation with |> > that domain (since an integer would be stored for each tuple rather than |> > an n byte length string), faster comparisons in a qualification (since |> > comparisons on that domain would be between integers and not strings), and |> > less chance of typo's in the database (since could not enter a string that |> > cannot be translated to an ID by a subquery into the ID relation). The |> > cost, of course, is the subquery required whenever you want to access the |> > domain. Does the answer depend on the length of the strings, etc ... |> |> A great example! It shows how tricky the whole thing can be. If there were |> only 10 possible values for your string, then I would certainly |> normalize and use the integer ID that you describe. However, if there were |> 500 possible string values, or 5000, then I may just stick the string in |> the same table as everything else and shine the integer ID. If there |> were 5000 possible values for your string, then it may not be worth the |> extra table because: |> |> 1. Any programs that need to look at the string would have to perform |> a sub-query. In some report writer languages, this can be quite difficult. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ So what? That's why you get the big bucks, isn't it? |> |> 2. You could increase the number of CPU and I/O resources required by |> the application. This means that your application could perform slower. What type of application? Queries may be slower, but updates may be significantly faster! You have to consider the whole system, not the piece of it you're working on today. If your data is static, it may be worth it. If it's volatile, you'll pay dearly for denormalization. |> |> 3. Your string table would have to be maintained by somebody. Keeping |> track of 5000 different string values is no trivial task. No kidding. But if you embed these strings in one table, what happens to the poor slob needs those strings in another table??? Is (s)he supposed to replicate this mess in the new table? So much for disk savings. And suppose someone changes the value of one of these strings, but neglects to change all of the other like values? Your data is now inconsistent. Your boss gets one value on his report, and his /her boss gets another on theirs. Your boss will want to know why (s)he looked like an a-hole in the meeting this morning. |> |> If there were only 10 possible string values, I WOULD normalize because: |> |> 1. It could cut down on disk space requirements. How do we know this? The number of distinct values is of less importance than other factors such as the number of tables related to this one and data volatility. |> |> 2. The string reference table would enforce standards for the strings, e.g. |> spellings. Why would one care about spellings if one had 10 possible values, but not care if there were 5000? |> |> 3. The system would be easier to use, since a data entry person would not be |> required to type in a string; he could simply choose one from an approved list. |> |> In database design, as in life, there are tradeoffs. |> There are no tradeoffs in design, only implementation. Programmer convenience is not a legitimate tradeoff, or at least pretty damn low on the list. Balance query time against update time. Consider data volatility when considering storing summary information. Try to look ahead in time. Will the data be used by other applications? Will it change? My point is that denormalization should not be undertaken lightly. You should understand the consequences of it, both near and long term. The benefits to the entire system should be clearly identified. --- Dale Cook cdm@inel.gov ========== long legal disclaimer follows, press n to skip =========== ^L Neither the United States Government or the Idaho National Engineering Laboratory or any of their employees, makes any warranty, whatsoever, implied, or assumes any legal liability or responsibility regarding any information, disclosed, or represents that its use would not infringe privately owned rights. No specific reference constitutes or implies endorsement, recommendation, or favoring by the United States Government or the Idaho National Engineering Laboratory. The views and opinions expressed herein do not necessarily reflect those of the United States Government or the Idaho National Engineering Laboratory, and shall not be used for advertising or product endorsement purposes.