Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!caip!sri-spam!parcvax!hplabs!pyramid!decwrl!amdcad!amdimage!prls!philabs!linus!raybed2!applicon!hdsvx1!hoffman
From: hoffman@hdsvx1.UUCP (Richard Hoffman)
Newsgroups: net.database
Subject: Re: Problem with relational theory
Message-ID: <452@hdsvx1.UUCP>
Date: Mon, 25-Aug-86 19:26:37 EDT
Article-I.D.: hdsvx1.452
Posted: Mon Aug 25 19:26:37 1986
Date-Received: Sun, 31-Aug-86 01:41:57 EDT
References: <595@ur-tut.UUCP>
Organization: Schlumberger-HDS, Houston TX
Lines: 83

Eric Carleen at Univ. of Rochester Medical Center writes [edited slightly]:
> Suppose that I have 3 tables of names, call them t1, t2, and t3.
> The names in tables 2 or 3 may or may not be in t1. 
> I would like to delete all those records in t1 that are already listed
> in the other two tables.
> 
> If I give the command:
> 
> 	delete t1 where t1.name = t2.name or t1.name = t3.name
> 
> then I get what I was after: entries that match EITHER t2 OR t3
> are deleted.
> 
> unless... t3 is empty.  If t3 is empty, then nothing is deleted
> from t1, regardless of whether or not there are any matching entries
> in t2.  (The work-around is obvious, but I thought unnecessary.)
> 
> Technical support at Relational Technologies insists that this is the
> correct way for a deletion to occur.

And they are, unfortunately, right.  This is because of the way that the
relational theory (upon which Ingres is based) works.

Relational theory starts with the assumption of sets, called domains,
which contain all the legal values for a given attribute.  For instance,
the domain of sex = {male, female}.  The domain of primary colors =
{red, blue, yellow}.  The domain of the national debt is probably the
set containing all positive real numbers taken to two decimal places.
Any set may be a domain.

A table is defined as the subset of a cartesian product of domain sets.
So consider a family whose members are listed in the domain F = {Sara,
David, Jon}. Each of these members can be associated with a sex chosen
from S = {M, F}.  The Cartesian product of FxS is {(Sara, M), (Sara, F),
(David, M), (David, F), (Jon, M), (Jon, F)}.  However, a table which
gave the sexes of the members of these families would be a subset of
FxS, namely {(Sara, F), (David, M), (Jon, M)}.

This formulation sheds light on a number of practices common in relational
data bases, such as the use of domains for guaranteeing integrity (this
should be required, but is often allowed to default to meaninglessly broad
domains such as "all integers", "all strings", etc.) and the refusal of
most RDBs to safeguard order inside a table (the table is just a set, and
order is unimportant in a set).  It also shows you why the Ingres way of
handling your query is correct.

What you are really doing when you run your query is forming a cartesian
product on the sets represented by t1.name, t2.name and t3.name.  Then
you are removing the subset of this set in which t1.name = t2.name or
t1.name = t3.name.  If t3 is empty, it is an EMPTY SET!  The cartestian
product of any number of sets with an empty set MUST BE EMPTY!  So there
is nothing to remove.  

It's not that Ingres doesn't process your query
correctly, it's just that it abandons processing as soon as it spots the
empty set in the cartesian product.  Sort of like a compiler whose optimizer
was clever enough to spot that hundreds of operations on x followed by
x = 0 meant that the hundreds of operations could be omitted.  But the
compiler would probably flag this, and RTI probably errs slightly by not
pointing out joins involving empty sets.

I'm sure you know the two work-arounds: either make t3 non-empty,
with a dummy value, or make two queries, so that the emptiness of 
either t2 or t3 does not effect the deletions based on
the other table.  This last procedure will probably be more efficient 
anyway (even for non-empty sets) for sufficiently large
tables, because the single query method will require on the order
of S1xS2xS3 operations (where Sn is the size in rows of table n),
while the double query method will only require on the order of
S1x(S2+S3) operations, plus a couple of extraneous file opens and
closes.  

I hear that RTI has really cleaned up their join optimizer 
in their most recent release, so the performance difference
between the two queries may be much smaller than I report.  
Also, primary and secondary indexes may make a big difference. Still,
if the tables are large enough to obviate the extra file opens
and closes, you're always safe using the two-query method.
-- 
 Richard Hoffman                | "Oh life is a wonderful cycle of song,
 Schlumberger Well Services     |     A medley of extemporanea.
 hoffman%hdsvx1@slb-doll.csnet  | And Love is a thing that can never go wrong
 PO Box 2175, Houston, TX 77252 | ... And I am Marie of Roumania." --D. PARKER