Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!sri-spam!ames!amdahl!rtech!jeff
From: jeff@rtech.UUCP
Newsgroups: comp.databases
Subject: Re: Normalized Databases
Message-ID: <739@rtech.UUCP>
Date: Thu, 2-Apr-87 01:45:58 EST
Article-I.D.: rtech.739
Posted: Thu Apr  2 01:45:58 1987
Date-Received: Sat, 4-Apr-87 13:30:51 EST
References: <43239@beno.seismo.CSS.GOV>
Distribution: na
Organization: Relational Technology, Alameda CA
Lines: 114
Keywords: normal form, relational

in article <43239@beno.seismo.CSS.GOV>, tiberio@seismo.CSS.GOV (Mike Tiberio) says:
> 
> At our facility we have a certain large investment in a relational database
> design that is not in the highest normal form. We have some folks proposing
> that we redesign our database (and no doubt out database interface software)
> to a greater or highest normal form. They state we will enjoy the following
> qualities when this is done.
> 
> 1) The amount of space required to store the data is lowest.

This is usually true, but not necessarily so.

> 2) The ability to update the database is greatest.
> 
> 3) Explaining the database is simplest.

Sometimes you can run into update problems in unnormalized databases,
because a single relationship can be stored in more than one place.
For example, suppose you are keeping track of parts, suppliers of
parts, and orders from those suppliers, and one of your tables looks like this:

	Supplier | City | Part | Color | Quantity
	-----------------------------------------
	   J1    |  NY  |  p1  | red   |   100
	   J1    |  NY  |  p2  | white |   200
	   S6    |  LA  |  p2  | white |    50
	   L3    |  DC  |  p2  | white |   300
	   L3    |  DC  |  p3  | blue  |   100

Now suppose the supplier J1 moves from NY to SF.  You will have to
update J1's city in two places.

Suppose you want to keep track of a new supplier, R2, who doesn't
currently supply you with anything.  There will have to be a row in
the table in which part, color, and quantity are blank.

Suppose supplier S6 stops supplying the part p2.  To delete that row from
the table would erase all information about S6, which you don't necessarily
want to do; instead, you must recognize the case where there is only one
row left for this supplier, and blank out the part, color, and quantity
for that row.

The problem here is normalization.  Without going into any detail on what
it means, here is the above information in third normal form:

	Supplier Table		Part Table

	Supplier | City		Part | Color
	---------------		------------
	   J1    |  NY		 p1  | red
	   S6    |  LA		 p2  | white
	   L3    |  DC		 p3  | white

	On-order Table

	Supplier | Part | Quantity
	--------------------------
	   J1    |  p1  |    100
	   J1    |  p2  |    200
	   S6    |  p2  |     50
	   L3    |  p2  |    300
	   L3    |  p3  |    100

Normalization is sometimes explained with a bunch of fancy jargon, but what
it boils down to is this:  Every table should stand for a single thing.  In
the example, the problem with the original table is that it stands for three
things: suppliers, parts, and orders.  Jamming these together into one table
causes problems when we want to update some aspect of one of the things.
A normalized database is easier to understand because the individual tables
are simpler: each one stands for only one thing.  The only thing that may
become harder is figuring out how to join the various tables together;
the main way to fight against this is to make sure that each of your tables
stands for some real thing in the problem you're trying to solve (if one is
not careful, one can come up with tables that don't model the problem, but
instead model the underlying algorithm; it's a good idea to design your
data first, and then figure out your algorithms).

This makes it sound easier than it really is.  Figuring out what your "things"
are can be difficult.

> What I am asking you folks in netland is for some guidance. I like the
> intellectual concept of normal form relational databases, but I need to
> know if there are any good reasons not to go that route.

The main argument against normalization is performance.  Normalization
involves splitting unnormalized tables up into smaller tables.  But this
means you will have to join the tables to recreate the relationships in
the original, unnormalized form.  Joins are slower than fetches from single
tables.

In the above example, suppose that you often need to know which parts
are coming from which cities.  In the normalized database, you would have to
join the supplier and the on-order tables.  If this took too much time,
you might decide to unnormalize your database and merge the two tables
into one:

	Supplier-Order Table

	Supplier | City | Part | Quantity
	---------------------------------
	   J1    |  NY  |  p1  |    100
	   J1    |  NY  |  p2  |    200
	   S6    |  LA  |  p2  |     50
	   L3    |  DC  |  p2  |    300
	   L3    |  DC  |  p3  |    100

Some of the update and understanding problems come back, but it will run
faster for the operation we're interested in.
-- 
Jeff Lichtman at rtech (Relational Technology, Inc.)
"Saints should always be judged guilty until they are proved innocent..."

{amdahl, sun}!rtech!jeff
{ucbvax, decvax}!mtxinu!rtech!jeff