Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site ima.UUCP Path: utzoo!watmath!clyde!burl!ulysses!bellcore!decvax!ima!johnl From: johnl@ima.UUCP (John R. Levine) Newsgroups: net.database Subject: Re: full text database systems Message-ID: <144@ima.UUCP> Date: Thu, 6-Mar-86 13:05:11 EST Article-I.D.: ima.144 Posted: Thu Mar 6 13:05:11 1986 Date-Received: Sat, 8-Mar-86 04:33:50 EST References: <362@isis.UUCP> Reply-To: johnl@ima.UUCP (John R. Levine) Distribution: net Organization: Javelin Software Corp. Lines: 48 Keywords: database text Summary: there's no typical full-text database In article <362@isis.UUCP> jay@isis.UUCP (Jay) writes: > I am interested in the methods used to create full text retrieval >databases - e.g. select articles based on words in an article. Specifically, >is there some general place I can go to get info on design/implementation of >such a system? Detailed questions such as article storage vs. concordance >storage vs. query processing vs. on and on. It is my impression that every full text data base yet built is a special hand crafted job. The most popular one appears to be LEXIS/NEXIS, which keeps on line the full text of legal decisions, newspapers, encyclopedias, and such. They keep complete indices of all of the words in every document, leaving out only words like "the" which are too common to have much indexing use. The documents are organized into libraries, e.g. Vermont superior court decisions for 1973, but they seem to swoop through the indices to do anything. A Lexis search usually takes the better part of a minute (although they're clever about sending stuff to your screen to keep you distracted in the meantime.) This only works because updates to the data base are applied very infrequently relative to the number of searches, so they add new text and remake the indices in the middle of the night. There have also been some attempts at making hardware engines that stream data from a disk as fast as the disk can provide it with full-track reads, and scan the text as it goes by. None of them seem to have been very successful, probably because reading a whole disk, even at full speed, takes a long time if the disk is at all large. The Britton-Lee IDM has a similar device which is used to speed up relational queries; it seems to work well but only because it is embedded in a data base system which structures and organizes data so that the speed-up board is not looking at whole disks. There are also systems that are hybrids between the Lexis approach and a conventional data base. One I've seen from BRS divides each document into sections such as sender, recipient, and separate paragraphs. This works well if your documents are fairly stylized, as business correspondence usually is, and lets you ask for "documents from Smith, to Jones, dated in 1978, containing a reference to 'grapefruit.'" I'd love to hear about more technically interesting text databases. Note that technologies like CD-ROMs in a sense only make the problem worse, since they allow very large amounts of data with relatively slow access to any part of it. There has to be some good way to organize it, and the problem will soon be upon us. -- John Levine, Javelin Software, Cambridge MA 617-494-1400 { decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA The opinions above are solely those of a 12 year old hacker who has broken into my account, and not those of my employer or any other organization.