Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!dimacs.rutgers.edu!mips!prls!pyramid!infmx!randall From: randall@informix.com (Randall Rhea) Newsgroups: comp.databases Subject: Re: Balancing Normalization with Performance Message-ID: <1990Nov20.190315.13505@informix.com> Date: 20 Nov 90 19:03:15 GMT References: <1092@cortex.med.jhu.edu> Sender: news@informix.com (Usenet News) Distribution: usa Organization: Informix Software, Inc. Lines: 70 In article <1092@cortex.med.jhu.edu> jamesr@cortex.med.jhu.edu (Jim Rosenberg) writes: >Hello database gurus. > >I am building my first database, and am trying to understand how to balance >the benefits of normalization with performance requirements. I am >interested in any heuristics or literate on deciding when to scrap >normalization to get a performance kick. This may be the biggest question in database design. There is no simple answer for it. Whether or not to normalize is a question that can only be answered on a case-by-case basis. In general, I have had difficulty with over-normalized database designs. There are analysts who have read a lot of database books, but have little experience in actually developing real applications ... these are the folks that tend to over-normalize. I find myself needing to write a fairly simple report, but it turns into a complicated program because data is all over the place. As you correctly point out, one gets tired of writing multiple sub-queries. I could go on and on about database design theory, but you could find one of those books that discuss it better than I could. I will, however, refer to your specific example: >An example issue I am trying to answer: > I have a domain that can have one of a given set of character string values. > I am thinking of creating a relation that equates each string with > an integer ID. The benefits would be less memory used in the relation with > that domain (since an integer would be stored for each tuple rather than > an n byte length string), faster comparisons in a qualification (since > comparisons on that domain would be between integers and not strings), and > less chance of typo's in the database (since could not enter a string that > cannot be translated to an ID by a subquery into the ID relation). The > cost, of course, is the subquery required whenever you want to access the > domain. Does the answer depend on the length of the strings, etc ... A great example! It shows how tricky the whole thing can be. If there were only 10 possible values for your string, then I would certainly normalize and use the integer ID that you describe. However, if there were 500 possible string values, or 5000, then I may just stick the string in the same table as everything else and shine the integer ID. If there were 5000 possible values for your string, then it may not be worth the extra table because: 1. Any programs that need to look at the string would have to perform a sub-query. In some report writer languages, this can be quite difficult. 2. You could increase the number of CPU and I/O resources required by the application. This means that your application could perform slower. 3. Your string table would have to be maintained by somebody. Keeping track of 5000 different string values is no trivial task. If there were only 10 possible string values, I WOULD normalize because: 1. It could cut down on disk space requirements. 2. The string reference table would enforce standards for the strings, e.g. spellings. 3. The system would be easier to use, since a data entry person would not be required to type in a string; he could simply choose one from an approved list. In database design, as in life, there are tradeoffs. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Randall Rhea Informix Software, Inc. Senior Programmer/Analyst, MIS uunet!pyramid!infmx!randall