Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!shadooby!samsung!uunet!dev!dgis!jkrueger
From: jkrueger@dgis.dtic.dla.mil (Jon)
Newsgroups: comp.databases
Subject: Re: Simple SQL question
Keywords: SQL
Message-ID: <670@dgis.dtic.dla.mil>
Date: 23 Nov 89 04:51:38 GMT
References: <98813@ti-csl.csc.ti.com> <1989Nov19.200834.266@mrspoc.Transact.COM> <1217@hpuviea.UUCP> <926@anasaz.UUCP>
Organization: Defense Technical Information Center (DTIC), Alexandria VA
Lines: 72

john@anasaz.UUCP (John Moore) writes:

[For a db like
	+------+
	| Frod | Location   Name   Birthdate
	+------+----------+------+------------+
               | Atlanta  |  Doe | 1 Sep 1960 |
               |  Boston  |  Ray | 2 Sep 1960 |
	       +----------+------+------------+
]
>Query: Give me the first ten names by date of birth where
>  location is one of a set of locations which is a small subset
>  of all the locations.

select * from frod where location = "Boston" or location = "Hartford"
	order by birthdate;

Alternatively:

select * from frod f, locations l where f.location = l.location
	order by birthdate;

No cursors necessary or desirable.  No 3GL necessary.  The problem
is trivial.  This is exactly what DBMS are designed to do.  Most
are quite good at it.

>Does someone have an optimizer which would generate a GOOD strategy for:
>  SELECT Name, Location 
>    FROM Frod 
>    WHERE Location in ( "Phoenix", "Atlanta", "Denver", "Seattle" )
>     AND  Day_of_Birth >= 12/31/47
>    ORDER BY Day_of_Birth ;

Yes.  Most of us do.  This is a simple single-table query.  The
optimization possibilities are small and amount to deciding whether to
select first on location or birth date.  If exactly one of them has an
index, it's a good candidate.  All commercial products can do this
much.

If both or neither have indices, or as an additional check, we could
examine data distributions in each column and predict which would
return fewer rows.  In this example, if locations were evenly
distributed among cities but only 10% of the birth dates were after
12/31/47, birth date would be a better candidate.  To my knowledge,
only INGRES's query optimizer does this.

After that, we could check integrities defined on the columns.  If
location draws from a list of valid locations, and "Phoenix" isn't on
the list, we can skip checking for it.  To my knowledge, no commercial
product does this.  Probably none supports integrities in a
sufficiently well-defined manner to make this optimization safe
anyway.  For instance, consider integrities maintained by triggers.

In summary, three optimizations is about it, and they all boil down to
deciding which of two columns to select on first.  After that, execution
speed is a near-linear function of disk-memory bandwidth.  If the table
size is small with respect to memory size, caching can change this to
a very-near-linear function of processor speed.

>Definition of good: performance >= what could be achieved by doing it
>by hand using low level operations.

Will you add in the labor costs of maintaining the low level
optimizations?  If not, long will your hand crafted solution compete on
existing hardware with the portable solution as it continues to run on
faster hardware?

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
Isn't it interesting that the first thing you do with your
color bitmapped window system on a network is emulate an ASR33?