Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!usc!ucsd!pacbell.com!att!att!watmath!watserv1!maytag!watdragon!watsol.waterloo.edu!tbray
From: tbray@watsol.waterloo.edu (Tim Bray)
Newsgroups: comp.arch
Subject: >32 bits: why?
Message-ID: <1990Oct27.173326.5806@watdragon.waterloo.edu>
Date: 27 Oct 90 17:33:26 GMT
References: <AGLEW.90Oct23161224@bach.crhc.uiuc.edu> <1990Oct25.011034.3664@ingres.Ingres.COM> <1990Oct26.012815.11941@watdragon.waterloo.edu> <CUTTING.90Oct26093955@elsinore.parc.xerox.com>
Sender: daemon@watdragon.waterloo.edu (Owner of Many System Processes)
Organization: University of Waterloo
Lines: 44

jpk@ingres suggested that >32-bit addressing for persistent objects might not
make any sense, given that the number of *objects* one addresses is usually
<< 2^32.

I replied that in text databases, character addressing is often required and
that 2^32 characters is not hard to come by.

cutting@parc.xerox.com (Doug Cutting) writes:
>In the text indexing literature one usually speaks of addressing words
>within documents, not characters within databases.  2^32 is large
>enough to hold both the number of documents in todays large text
>databases and the number of words (or characters) in the largest
>individual documents.

(This should really be in comp.text.databases or comp.database.text, which
would be nice groups if they existed).

Cutting is right if you're content to address *words* in pre-cooked 
*documents*.  But

1. For many applications indexing words is counter-productive, indexing
   arbitrary-length byte strings much more useful.  The techniques for doing
   this pretty well require database byte granularity.
2. Assuming that you know what the "documents" are at index creation time
   is just plain wrong, given the nature of text.  Users should have the
   option of defining new "document" structures at arbitrary locations in
   the database at any time based on their own criteria.  This is why
   relational technology is sometimes a poor match for text applications.
3. Most databases contain many types of documents, which will typically nest
   within each other and overlap.  If a particular index point is contained 
   with instances of document type "article", "section", "subsection", 
   "paragraph", "page", and "cross reference", you're going to have trouble 
   building an index structure that efficiently supports this mapping without 
   byte granularity.

So, there are important application classes where persistent pointers >32
bits are necessary.  

But I'm not objective - our company is based on selling software that deals
with texts using a model constrained by #3 above.  I'm really hoping there
are lots of other applications types putting pressure on to leapfrog the 
32-bit curve.

Cheers, Tim Bray, Open Text Systems