Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!caip!sri-spam!parcvax!hplabs!pyramid!decwrl!amdcad!amdimage!prls!philabs!linus!raybed2!applicon!hdsvx1!hoffman From: hoffman@hdsvx1.UUCP (Richard Hoffman) Newsgroups: net.database Subject: Re: Problem with relational theory Message-ID: <452@hdsvx1.UUCP> Date: Mon, 25-Aug-86 19:26:37 EDT Article-I.D.: hdsvx1.452 Posted: Mon Aug 25 19:26:37 1986 Date-Received: Sun, 31-Aug-86 01:41:57 EDT References: <595@ur-tut.UUCP> Organization: Schlumberger-HDS, Houston TX Lines: 83 Eric Carleen at Univ. of Rochester Medical Center writes [edited slightly]: > Suppose that I have 3 tables of names, call them t1, t2, and t3. > The names in tables 2 or 3 may or may not be in t1. > I would like to delete all those records in t1 that are already listed > in the other two tables. > > If I give the command: > > delete t1 where t1.name = t2.name or t1.name = t3.name > > then I get what I was after: entries that match EITHER t2 OR t3 > are deleted. > > unless... t3 is empty. If t3 is empty, then nothing is deleted > from t1, regardless of whether or not there are any matching entries > in t2. (The work-around is obvious, but I thought unnecessary.) > > Technical support at Relational Technologies insists that this is the > correct way for a deletion to occur. And they are, unfortunately, right. This is because of the way that the relational theory (upon which Ingres is based) works. Relational theory starts with the assumption of sets, called domains, which contain all the legal values for a given attribute. For instance, the domain of sex = {male, female}. The domain of primary colors = {red, blue, yellow}. The domain of the national debt is probably the set containing all positive real numbers taken to two decimal places. Any set may be a domain. A table is defined as the subset of a cartesian product of domain sets. So consider a family whose members are listed in the domain F = {Sara, David, Jon}. Each of these members can be associated with a sex chosen from S = {M, F}. The Cartesian product of FxS is {(Sara, M), (Sara, F), (David, M), (David, F), (Jon, M), (Jon, F)}. However, a table which gave the sexes of the members of these families would be a subset of FxS, namely {(Sara, F), (David, M), (Jon, M)}. This formulation sheds light on a number of practices common in relational data bases, such as the use of domains for guaranteeing integrity (this should be required, but is often allowed to default to meaninglessly broad domains such as "all integers", "all strings", etc.) and the refusal of most RDBs to safeguard order inside a table (the table is just a set, and order is unimportant in a set). It also shows you why the Ingres way of handling your query is correct. What you are really doing when you run your query is forming a cartesian product on the sets represented by t1.name, t2.name and t3.name. Then you are removing the subset of this set in which t1.name = t2.name or t1.name = t3.name. If t3 is empty, it is an EMPTY SET! The cartestian product of any number of sets with an empty set MUST BE EMPTY! So there is nothing to remove. It's not that Ingres doesn't process your query correctly, it's just that it abandons processing as soon as it spots the empty set in the cartesian product. Sort of like a compiler whose optimizer was clever enough to spot that hundreds of operations on x followed by x = 0 meant that the hundreds of operations could be omitted. But the compiler would probably flag this, and RTI probably errs slightly by not pointing out joins involving empty sets. I'm sure you know the two work-arounds: either make t3 non-empty, with a dummy value, or make two queries, so that the emptiness of either t2 or t3 does not effect the deletions based on the other table. This last procedure will probably be more efficient anyway (even for non-empty sets) for sufficiently large tables, because the single query method will require on the order of S1xS2xS3 operations (where Sn is the size in rows of table n), while the double query method will only require on the order of S1x(S2+S3) operations, plus a couple of extraneous file opens and closes. I hear that RTI has really cleaned up their join optimizer in their most recent release, so the performance difference between the two queries may be much smaller than I report. Also, primary and secondary indexes may make a big difference. Still, if the tables are large enough to obviate the extra file opens and closes, you're always safe using the two-query method. -- Richard Hoffman | "Oh life is a wonderful cycle of song, Schlumberger Well Services | A medley of extemporanea. hoffman%hdsvx1@slb-doll.csnet | And Love is a thing that can never go wrong PO Box 2175, Houston, TX 77252 | ... And I am Marie of Roumania." --D. PARKER