Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!decwrl!labrea!rutgers!im4u!ut-sally!utah-cs!hollaar From: hollaar@utah-cs.UUCP (Lee Hollaar) Newsgroups: comp.arch Subject: Re: Serial Search Machines Message-ID: <5090@utah-cs.UUCP> Date: Thu, 29-Oct-87 15:39:56 EST Article-I.D.: utah-cs.5090 Posted: Thu Oct 29 15:39:56 1987 Date-Received: Wed, 4-Nov-87 02:29:46 EST References: <2667@uunet.UU.NET> Reply-To: hollaar@cs.utah.edu.UUCP (Lee Hollaar) Organization: University of Utah CS Dept Lines: 52 Keywords: Text search machines, information retrieval, inverted files Summary: Search machines and indexing mix well In article <2667@uunet.UU.NET> mo@uunet.UU.NET (Mike O'Dell) writes: Sequential text search machines have been around for a long time, but recently they have started to become affordable for many applications. ... They all suffer a few interesting problems however. Like everything highly specialized, they can win very big and lose very big. ... their strength can become a great weakness because they must read through all the data for each and every search. If the collection is say one to two gigabytes, which is rather modest, dumping that much data off the filesystem and through the box for each query quickly gets pretty painful. In fact, there is no reason why a reasonably contructed text search machine can't use some form of index to reduce the amount of searching that must be done. (The text search machine and its accompanying system, which we have had running for a number of years with satisfied users, does exactly that.) In fact, they work well together. The ability to rapidly search selected areas on a disk means that the index does not need to be as complete, and therefore takes less disk space (compared to the commercial systems, which have an index at least as large as the database size, a real drag for gigabyte databases!). We use a partially inverted file index, where the index indicates which documents contain a particular term, but not its location. Phrases are handled by determining all documents containing the constituent terms, and then searching them to see if the actual phrase occurs. The index overhead is about 15% - 20%, and the index eliminates 95% - 99% of the database from the search, resulting in search times of under ten seconds. We also have the search machine connected to its own disk, which stores the text information in a contiguous file system. This eliminates the bandwidth requirements on the normal file system, speeds up searching considerably because a seek is not necessary before reading each block, and there is a separate searcher for each disk, so that the natural parallelism in a big database stored on many disks is exploited. For rapid searching of other than the complete database (as is the case where an index is used), the seek time of the disk plays a major role in overall system performance. For a description of what we have done, see: L A Hollaar, "A Testbed for Information Retrieval Research: The Utah Retrieval System Architecture", Proceedings of SIGIR-85, June 1985, pp 227-232. R L Haskin and L A Hollaar, "Operational Characteristics of a Hardware-based Pattern Matcher", ACM Transactions on Database Systems, Volume 8 Number 1, March 1983. L A Hollaar, "The Utah Text Search Engine: Implementation Experiences and Future Plans", Proceedings of the Fourth International Workshop on Database Machines, March 1985. L A Hollaar and R L Haskin, U. S. Patent 4,450,520, "Method and System for Matching Encoded Characters", May 22, 1984.