Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!usc!ucsd!ucbvax!agate!bionet!arisia!roo!cutting From: cutting@parc.xerox.com (Doug Cutting) Newsgroups: comp.arch Subject: Re: >32 bits: why? Message-ID: Date: 26 Oct 90 16:39:55 GMT References: <1990Oct25.011034.3664@ingres.Ingres.COM> <1990Oct26.012815.11941@watdragon.waterloo.edu> Sender: news@parc.xerox.com Organization: Xerox PARC, Palo Alto, CA Lines: 29 In-reply-to: tbray@watsol.waterloo.edu's message of 26 Oct 90 01:28:15 GMT In article <1990Oct26.012815.11941@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes: One counter-example is text databases. Given the blurring between semantics and structure in text, and the need to support things like full text search, it is often necessary that pointers have character granularity. And indexes, which need to be persistent, are just collections of pointers. And text databases larger than 2^32 are starting to be come rather un-rare. Of course, there are ways to work around this, just as there are ways to more-or-less do real computation on 16 bit computers. But there are many wins from having your persistent pointers effiently dealt with via word-oriented computer operations. For example, a not-uncommon text database operation might be: which of these 1.87 million citations (pointers starting at 0x23fc8932 in the index) contain instances of both "Japan" (37,210 occurrences in the database) and "telephone" (41,881 occurrences). There are algorithms that can do the pointer manipulations and give you answers pretty quick, but they run a lot better if you can do a lot of the comparisons and moves in hardware. In the text indexing literature one usually speaks of addressing words within documents, not characters within databases. 2^32 is large enough to hold both the number of documents in todays large text databases and the number of words (or characters) in the largest individual documents. Perhaps you regard this as a workaround, but it too allows optimizations. When intersecting document-major citation lists one can skip all the occurences within a document if that document does not occur in the other citation list. Doug