Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!usc!ucsd!ucbvax!agate!bionet!arisia!roo!cutting
From: cutting@parc.xerox.com (Doug Cutting)
Newsgroups: comp.arch
Subject: Re: >32 bits: why?
Message-ID: <CUTTING.90Oct26093955@elsinore.parc.xerox.com>
Date: 26 Oct 90 16:39:55 GMT
References: <AGLEW.90Oct23161224@bach.crhc.uiuc.edu>
	<1990Oct25.011034.3664@ingres.Ingres.COM>
	<1990Oct26.012815.11941@watdragon.waterloo.edu>
Sender: news@parc.xerox.com
Organization: Xerox PARC, Palo Alto, CA
Lines: 29
In-reply-to: tbray@watsol.waterloo.edu's message of 26 Oct 90 01:28:15 GMT

In article <1990Oct26.012815.11941@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:

   One counter-example is text databases.  Given the blurring between semantics
   and structure in text, and the need to support things like full text search,
   it is often necessary that pointers have character granularity.  And indexes,
   which need to be persistent, are just collections of pointers.  And text
   databases larger than 2^32 are starting to be come rather un-rare.

   Of course, there are ways to work around this, just as there are ways to
   more-or-less do real computation on 16 bit computers.  But there are many
   wins from having your persistent pointers effiently dealt with via 
   word-oriented computer operations.  For example, a not-uncommon text database
   operation might be: which of these 1.87 million citations (pointers starting
   at 0x23fc8932 in the index) contain instances of both "Japan" (37,210
   occurrences in the database) and "telephone" (41,881 occurrences).  There
   are algorithms that can do the pointer manipulations and give you answers
   pretty quick, but they run a lot better if you can do a lot of the
   comparisons and moves in hardware.

In the text indexing literature one usually speaks of addressing words
within documents, not characters within databases.  2^32 is large
enough to hold both the number of documents in todays large text
databases and the number of words (or characters) in the largest
individual documents.  Perhaps you regard this as a workaround, but it
too allows optimizations.  When intersecting document-major citation
lists one can skip all the occurences within a document if that
document does not occur in the other citation list.

	Doug