Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!decwrl!labrea!rutgers!im4u!ut-sally!utah-cs!hollaar
From: hollaar@utah-cs.UUCP (Lee Hollaar)
Newsgroups: comp.arch
Subject: Re: Serial Search Machines
Message-ID: <5090@utah-cs.UUCP>
Date: Thu, 29-Oct-87 15:39:56 EST
Article-I.D.: utah-cs.5090
Posted: Thu Oct 29 15:39:56 1987
Date-Received: Wed, 4-Nov-87 02:29:46 EST
References: <2667@uunet.UU.NET>
Reply-To: hollaar@cs.utah.edu.UUCP (Lee Hollaar)
Organization: University of Utah CS Dept
Lines: 52
Keywords: Text search machines, information retrieval, inverted files
Summary: Search machines and indexing mix well

In article <2667@uunet.UU.NET> mo@uunet.UU.NET (Mike O'Dell) writes:
    Sequential text search machines have been around for a long
    time, but recently they have started to become affordable
    for many applications. ... They all suffer a few interesting
    problems however.  Like everything highly specialized, they
    can win very big and lose very big.
    
        ... their strength can become a great 
    weakness because they must read through all the data for
    each and every search.  If the collection is say one to two
    gigabytes, which is rather modest, dumping that much data
    off the filesystem and through the box for each query quickly
    gets pretty painful.

In fact, there is no reason why a reasonably contructed text search machine
can't use some form of index to reduce the amount of searching that must be
done.  (The text search machine and its accompanying system, which we have had
running for a number of years with satisfied users, does exactly that.)  In
fact, they work well together.  The ability to rapidly search selected areas
on a disk means that the index does not need to be as complete, and therefore
takes less disk space (compared to the commercial systems, which have an index
at least as large as the database size, a real drag for gigabyte databases!).
We use a partially inverted file index, where the index indicates which
documents contain a particular term, but not its location.  Phrases are
handled by determining all documents containing the constituent terms, and
then searching them to see if the actual phrase occurs.  The index overhead is
about 15% - 20%, and the index eliminates 95% - 99% of the database from the
search, resulting in search times of under ten seconds.

We also have the search machine connected to its own disk, which stores the
text information in a contiguous file system.  This eliminates the bandwidth
requirements on the normal file system, speeds up searching considerably
because a seek is not necessary before reading each block, and there is a
separate searcher for each disk, so that the natural parallelism in a big
database stored on many disks is exploited.  For rapid searching of other than
the complete database (as is the case where an index is used), the seek time
of the disk plays a major role in overall system performance.

For a description of what we have done, see:
L A Hollaar, "A Testbed for Information Retrieval Research: The Utah Retrieval
        System Architecture", Proceedings of SIGIR-85, June 1985, pp 227-232.

R L Haskin and L A Hollaar, "Operational Characteristics of a Hardware-based
        Pattern Matcher", ACM Transactions on Database Systems, Volume 8
        Number 1, March 1983.

L A Hollaar, "The Utah Text Search Engine: Implementation Experiences and
        Future Plans", Proceedings of the Fourth International Workshop on
        Database Machines, March 1985.

L A Hollaar and R L Haskin, U. S. Patent 4,450,520, "Method and System for
        Matching Encoded Characters", May 22, 1984.