Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site ima.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!bellcore!decvax!ima!johnl
From: johnl@ima.UUCP (John R. Levine)
Newsgroups: net.database
Subject: Re: full text database systems
Message-ID: <144@ima.UUCP>
Date: Thu, 6-Mar-86 13:05:11 EST
Article-I.D.: ima.144
Posted: Thu Mar  6 13:05:11 1986
Date-Received: Sat, 8-Mar-86 04:33:50 EST
References: <362@isis.UUCP>
Reply-To: johnl@ima.UUCP (John R. Levine)
Distribution: net
Organization: Javelin Software Corp.
Lines: 48
Keywords: database text
Summary: there's no typical full-text database

In article <362@isis.UUCP> jay@isis.UUCP (Jay) writes:
>	I am interested in the methods used to create full text retrieval
>databases - e.g. select articles based on words in an article.  Specifically,
>is there some general place I can go to get info on design/implementation of
>such a system?  Detailed questions such as article storage vs. concordance
>storage vs. query processing vs. on and on.

It is my impression that every full text data base yet built is a special hand 
crafted job.  The most popular one appears to be LEXIS/NEXIS, which keeps on 
line the full text of legal decisions, newspapers, encyclopedias, and such.  
They keep complete indices of all of the words in every document, leaving out 
only words like "the" which are too common to have much indexing use.  The 
documents are organized into libraries, e.g.  Vermont superior court decisions 
for 1973, but they seem to swoop through the indices to do anything.  A Lexis 
search usually takes the better part of a minute (although they're clever 
about sending stuff to your screen to keep you distracted in the meantime.) 
This only works because updates to the data base are applied very infrequently 
relative to the number of searches, so they add new text and remake the 
indices in the middle of the night.  

There have also been some attempts at making hardware engines that stream data 
from a disk as fast as the disk can provide it with full-track reads, and scan 
the text as it goes by.  None of them seem to have been very successful, 
probably because reading a whole disk, even at full speed, takes a long time 
if the disk is at all large.  The Britton-Lee IDM has a similar device which 
is used to speed up relational queries; it seems to work well but only because 
it is embedded in a data base system which structures and organizes data so 
that the speed-up board is not looking at whole disks.  

There are also systems that are hybrids between the Lexis approach and a 
conventional data base.  One I've seen from BRS divides each document into 
sections such as sender, recipient, and separate paragraphs.  This works well 
if your documents are fairly stylized, as business correspondence usually is, 
and lets you ask for "documents from Smith, to Jones, dated in 1978, 
containing a reference to 'grapefruit.'" 

I'd love to hear about more technically interesting text databases.  Note that 
technologies like CD-ROMs in a sense only make the problem worse, since they 
allow very large amounts of data with relatively slow access to any part of 
it.  There has to be some good way to organize it, and the problem will soon 
be upon us.  
-- 

John Levine, Javelin Software, Cambridge MA 617-494-1400
{ decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA

The opinions above are solely those of a 12 year old hacker who has broken
into my account, and not those of my employer or any other organization.