Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site wjh12.UUCP
Path: utzoo!linus!vaxine!wjh12!grc
From: grc@wjh12.UUCP (crane)
Newsgroups: net.general,net.unix-wizards
Subject: UNIX system to house 140 mbyte unformatted textual dbase?
Message-ID: <476@wjh12.UUCP>
Date: Sun, 27-May-84 19:56:09 EDT
Article-I.D.: wjh12.476
Posted: Sun May 27 19:56:09 1984
Date-Received: Wed, 30-May-84 00:12:43 EDT
Organization: Harvard University PSR, Cambridge MA
Lines: 37

We have an unformatted textual database currently comprising 140
mbytes of text, which will grow to about 500 mbytes within the
next two years. Inverted indices (50% overhead--on top of
140 mbytes of text) have been developed,
but for some applications (such as fixed phrases or combinations of
common words) it is necessary to perform a linear search 
on the entire corpus.

a) i am interested in benchmarks to see how fast different machines
can perform linear searches. in particular, i would like to know
how fast the command "egrep xxx /usr/dict/words" (where
/usr/dict/words ~= 200K) runs on a GOULD, PYRAMID, ZILOG or different
68K based systems. We have access to a VAX 11/750 and 780, PDP 11/44
and PIXEL 100. Benchmarks from any other systems would be greatly
appreciated. The PIXEL is quite fast in core, but the disks
are ruinously slow: an otherwise idle PIXEL 100 (with 40 mbyte disks)
can only spend 30% of its time on an egrep. the rest of the time
it is evidently twiddling electrons waiting for more disk blocks.
does anybody out there have a Sun with the Fujitsu eagle?
	This dbase has a limited clientele, and the machine would not
need to field more than 4 searches or so at a time, but we could
easily use a more powerful system and would as soon not dedicate a
system to this database.

b) does anyone out there know of any good way to deal with searching
this much data on a UNIX system? experiments in distributed processing
that could provide wide access cheaply? this is a read only dbase, so
we could avoid the UNIX file system and store the data in big blocks
on a raw file system. has anyone got some special hardware hanging off
of a UNIX system to perform this kind of task?

						Gregory Crane
						Harvard University