Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!van-bc!tfic.bc.ca!clh
From: clh@tfic.bc.ca (Chris Hermansen)
Newsgroups: comp.databases
Subject: Re: Outerjoin implementation?
Message-ID: <1990Sep28.225442.4995@tfic.bc.ca>
Date: 28 Sep 90 22:54:42 GMT
References: <1422@meaddata.meaddata.com>
Reply-To: clh@tacitus.UUCP (Chris Hermansen)
Distribution: comp
Organization: Timberline Forest Inventory Consultants
Lines: 136

In article <1422@meaddata.meaddata.com> meaddata!gordon@uunet.uu.net writes:
>
>In article <1990Sep20.161204.13014@tfic.bc.ca>, clh@tfic.bc.ca (Chris
>Hermansen) writes:
>|> In article <CIMSHOP!DAVIDM.90Sep19100915@uunet.UU.NET>
>cimshop!davidm@uunet.UU.NET (David S. Masterson) writes:
>|> >In article <PCG.90Sep18192004@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk
>|> >(Piercarlo Grandi) writes:

[stuff deleted regarding use of nulls, particularly in outer joins]

>Absolutely, as a matter of fact, I think the NULL is completely necessary for
>the accurate representation of data.  If, for example, you have a integer
>attribute called status, and you don't currently know the value, what
>do you assign to status?  I guess you could look through the possible domain
>of status and then pick a value outside that domain to represent unknown,
>but then you have to heavily document this so maintenance people know what
>you did, don't you think?

OK, I'll bite... suppose I have a variable called status.  Why is NULL better than
a value of "unknown" (assuming it's a character value)?  I would claim that
in fact "unknown" is superior, as it is at least somewhat self documenting (NULL,
by definition, is NOT); furthermore, I can easily spot the "unknown" rows in any
printout or report; finally, I can easily count the rows of particular status
in the table with one select statement (whereas I need two with NULL values allowed,
as rows with NULL values do not group together).

[more stuff deleted]

>Well the above is really placing two distinct constraints on the update.
>Constraint 1:  qqsv cannot have a known value of 0.
>Constraint 2:  qqsv cannot have an unknown value.
>
>If qqsv was guaranteed to be known, you can always declare it as *not null*.

This may relate to the Informix SQL comment below.

>Then, Constraint 2 would become unecessary.  If you suggest an arbitrary value
>for unknowns, what would the statement be?
>
>	ubdate foo set bar = 37 where qqsv != 0 and qqsv != -99999999
>
>This is easier than using null?  I don't think so.

I don't think it's any more difficult!  Maybe this isn't the best example, but
I guess it depends on why the NULLs are in qqsv in the first place.  If they
got there because of a low level of quality control in the data input phase,
I might prefer to see the data input stuff (screen forms or whatever) cleaned
up; if they got there because someone genuinely wanted to show that qqsv didn't
have a value entered into it, I guess it's really a tossup as to whether one
uses NULL or some other value (that could potentially have more `local' meaning
to the application).

As you say above, outer joins make a strong case for the existence of a
"well defined" undefined value.

>
>|> 
>|> [text deleted...]
>|> 
>|> Note that I use Informix SQL, so your lack of mileage may vary on the above
>
>
>????? The above statement makes no sense. ?????

Aww, c'mon!  All I'm saying is that I'm basing my comments on the RDBMS I use,
not on the ANSI standard, nor on Quel/Ingres which was mentioned earlier on.

>Maybe you have difficulty undersanding the proper use of null.  It means, "I
>don't know what this value is."  I'll use an example from Date, "... a 
>shipment record might contain a null quantity value (we know that the 
>shipment exists, but we do not know the quantity shipped)..."

I'd be the first to admit that I may have difficulty understanding the use of
NULL.  But think about your example; why should the input screen have let
someone enter the fact that a shipment was made without knowing the number
of items shipped?  How would you propose to send the customer a bill, if
you don't know whether you sent her/him 3 SPARCStations or 3000? Now suppose
that you're the boss of the shipping department, and you find out that your
programmer decided to allow NULLs in the amount_shipped field by casually
noting that your warehouse contained no more SPARCStations because someone
on the loading dock didn't bother to fill in the waybill properly and the
fast-knuckled keypuncher blithely skipped over the amount_shipped field,
so not only are you going to have to call 200 customers to find out how
many you sent them, but you're going to have to call head office and
tell them that your order for more stock should have had more than zero
items on it.

And in any case, I would claim that a zero would do as well as a NULL here, since
most shipping departments would hardly bother to ship zero units (let alone
generate a shipping report with zero units on it).


If you're doing an outer join, a better(?) example of NULLs cropping up might be
orders entered but not yet shipped...

>
>If you find that you have lots of nulls, then maybe your schema is not
>properly 
>normalized.  Try checking for transitive dependencies, I have found that they
>can contribute to an excessive amount of nulls.

Naw, all my schemae are perfectly normalized :-) :-) :-)

>Null is a very important semantic concept.  There have been many times I would
>have loved to have null when writing 3GL applications (I had to rely on -999).
>As for additional constraints, anytime you deal with an uknown you have take
>extra measures to account for it.
>
>What do you think?

Well, back to the status = "unknown" vs status is null; I don't feel that there
is any innate superiority of the null value here, as long as one has control over
the value of the field (ie, it's not the result of an outer join).  Seems to me
that one is semantically using two fields in a lot of cases: qqsv and qqsv_status.

Finally, if I have a table with 6,000,000 records in it, with status NULL in 5,999,999
and status = "complete" in the other, and I use ISQL to execute the following:

	output to printer
	select status,count(*) from foo
	group by status
	order by status;

I will end up with 100,000 PAGES of output on my laser printer.  On the other hand,
if those 5,999,999 records have status = "unknown", I'll get one page.

Again, I (feel I) should emphasize that this particular behaviour may not occur in
ALL SQLs, or even be predicted by the ANSI standard.

Chris Hermansen                         Timberline Forest Inventory Consultants
Voice: 1 604 733 0731                   302 - 958 West 8th Avenue
FAX:   1 604 733 0634                   Vancouver B.C. CANADA
clh@tfic.bc.ca                          V5Z 1E5

C'est ma facon de parler.