Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!usc!ucsd!pacbell.com!att!att!watmath!watserv1!maytag!watdragon!watsol.waterloo.edu!tbray From: tbray@watsol.waterloo.edu (Tim Bray) Newsgroups: comp.arch Subject: >32 bits: why? Message-ID: <1990Oct27.173326.5806@watdragon.waterloo.edu> Date: 27 Oct 90 17:33:26 GMT References: <1990Oct25.011034.3664@ingres.Ingres.COM> <1990Oct26.012815.11941@watdragon.waterloo.edu> Sender: daemon@watdragon.waterloo.edu (Owner of Many System Processes) Organization: University of Waterloo Lines: 44 jpk@ingres suggested that >32-bit addressing for persistent objects might not make any sense, given that the number of *objects* one addresses is usually << 2^32. I replied that in text databases, character addressing is often required and that 2^32 characters is not hard to come by. cutting@parc.xerox.com (Doug Cutting) writes: >In the text indexing literature one usually speaks of addressing words >within documents, not characters within databases. 2^32 is large >enough to hold both the number of documents in todays large text >databases and the number of words (or characters) in the largest >individual documents. (This should really be in comp.text.databases or comp.database.text, which would be nice groups if they existed). Cutting is right if you're content to address *words* in pre-cooked *documents*. But 1. For many applications indexing words is counter-productive, indexing arbitrary-length byte strings much more useful. The techniques for doing this pretty well require database byte granularity. 2. Assuming that you know what the "documents" are at index creation time is just plain wrong, given the nature of text. Users should have the option of defining new "document" structures at arbitrary locations in the database at any time based on their own criteria. This is why relational technology is sometimes a poor match for text applications. 3. Most databases contain many types of documents, which will typically nest within each other and overlap. If a particular index point is contained with instances of document type "article", "section", "subsection", "paragraph", "page", and "cross reference", you're going to have trouble building an index structure that efficiently supports this mapping without byte granularity. So, there are important application classes where persistent pointers >32 bits are necessary. But I'm not objective - our company is based on selling software that deals with texts using a model constrained by #3 above. I'm really hoping there are lots of other applications types putting pressure on to leapfrog the 32-bit curve. Cheers, Tim Bray, Open Text Systems