Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!cwjcc!gatech!ncsuvx!lll-winken!lll-lcc!pyramid!infmx!aland
From: aland@infmx.UUCP (Dr. Scump)
Newsgroups: comp.databases
Subject: Re: need HELP with index problem
Summary: index strategy for indexing fields with many duplicates
Keywords: index informix
Message-ID: <715@infmx.UUCP>
Date: 28 Dec 88 05:55:01 GMT
References: <8973@cit-vax.Caltech.Edu>
Organization: Informix Software Inc., Menlo Park, CA.
Lines: 104

In article <8973@cit-vax.Caltech.Edu>, citron@cit-vax.Caltech.Edu (Mark C. Citron) writes:
> Hope someone can understand what is going wrong here. I have a
> table in a database that has three indices: 
> a composite unique index made up of date, name1 and slotnumber
> an index of name2
> an index of unitnumber
>
> there are about 50,000 rows in the table. In about 90% of the rows the
> unitnumber is 0. In time these will be changed to other numbers (the
> slots will be "filled"). 

I am assuming that you query on this field often, looking for the 
"exceptions" (e.g. those rows with "filled" slots).  If you do *not*
search on this field often, then having an index on it is counter-
productive.

> Now here is the problem: When I update the unitnumber field
> and that field already contains a number OTHER THAN 0 
> the update goes very fast.  GREAT!

Well, that's nice to hear :-]

> Updating the unitnumber field to 0 (independent of what it was) also
> goes very fast (less than a sec).

Now I'm confused.  Well, maybe I'm not.  Lessee... in other words, 
as long as the field's value *before the update* was other than 0, it 
goes fast.  OK, makes sense...

> HOWEVER, when I update the unitnumber field
> and that field already contains a 0, the update takes forever (about 30
> sec). There is a lot of disk accessing.

Yikes!  30 seconds is forever?  I'm older than I thought!  :-]
Seriously, though, this makes sense.  What happens when there is a lot
of duplication within a field is that those "identical" index entries
(since you had just the one field in that index) must be traversed
sequentially, since it can't identify whether or not other WHERE 
clause criteria (if any) are satisfied just by looking at that 
particular index.

When you run your update, the proper row(s) must be identified first.
*That* is what takes so much time; since you have an index on that
exact field already, the optimizer will use that index first to 
qualify the needed row(s).  The high degree of disk usage that you see
is the traversal of these identical entries in the B+ index tree,
checking the other search criteria for each of these rows which
have unitnumber = 0.

> This is an Informix database. I might guess that the problem is
> that although unitnumber is indexed, there are so many rows with
> the same unit number that Informix has trouble locating the
> record to update. Is this correct? How do I get around this?
> This performance is awful!

Exactly.  Not so much that there is "trouble locating the record" 
as "we have to check every record qualified by this value in the
index."  Same effect.  

To avoid this duplicate search, the best bet is to make that index
a composite index also.  I would guess that this update you run
probably uses the same criteria each time, just with different 
values; you probably search for something like:
    .. WHERE unitnumber = 0  AND slotnumber = 32768
or 
    .. WHERE unitnumber = 0  AND date = today
or whatever.  The solution to this is simple: create that index for
unitnumber as a *composite* index, tacking whatever other field is
your most frequent "additional" search criteria after the unitnumber.
For example, if the first criterion above is what you use, create
a composite of unitnumber and slotnumber, thus:
    CREATE INDEX unitnslot ON tablename(unitnumber, slotnumber)

or, if the second criterion above is what you use, create
a composite of unitnumber and date:
    CREATE INDEX unitndate ON tablename(unitnumber, curdate)

*instead of* the singleton index on unitnumber.  Now, there will be
very few duplicates (or none at all) in the index.  Also, for a compound
condition search (e.g. WHERE unitnumber = 0 AND slotnumber = 32768),
the composite index will be used directly to find what should be only
a few rows (and therefore find them quickly).  If a search is performed
on unitnumber alone, it will still be fast, since the optimizer can use
composite indexes for searches on the first column in the index as
easily as if it was a singleton index on that column.  (Of course, if
you search on unitnumber=0 alone, it will be slower because you will 
be qualifying some 45,000 rows).

By the way, there is a limit of 64K occurrences of the same value
in a C-ISAM index (in the current and past releases, anyway).
If you had a column with, say, the value 0 in 70,000 of the rows and
had a singleton index on that column, you would have problems with
that index.  (BCHECK and CHECK TABLE would complain).

> Thanks for any (ANY!) suggestions. Write or post.
> citron@csvax.caltech.edu
> citron@cit-vax.UUCP

You're welcome.

-- 
 Alan S. Denney  |  Informix Software, Inc.  |  {pyramid|uunet}!infmx!aland
 Disclaimer: These opinions are mine alone.  If I am caught or killed,
             the secretary will disavow any knowledge of my actions.
 Santos' 4th Law: "Anything worth fighting for is worth fighting *dirty* for"