Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!mcgill-vision!snorkelwacker!usc!samsung!umich!yale!cs.yale.edu!zenith-steven From: zenith-steven@cs.yale.edu (Steven Ericsson Zenith) Newsgroups: comp.arch Subject: Re: Big files, and lots of 'em: 32 bits is not enough Message-ID: <1990Aug9.142622@cs.yale.edu> Date: 9 Aug 90 18:26:20 GMT References: <5539@darkstar.ucsc.edu> <13285@yunexus.YorkU.CA> <30728@super.ORG> <13667@cbmvax.commodore.com> <40644@mips.mips.COM> <1990Aug8.222644.23683@watdragon.waterloo.edu> Sender: news@cs.yale.edu Reply-To: zenith-steven@cs.yale.edu (Steven Ericsson Zenith) Organization: Yale University Computer Science Dept., New Haven, CT 06520-2158 Lines: 61 Nntp-Posting-Host: king.systemsy.cs.yale.edu In article <1990Aug8.222644.23683@watdragon.waterloo.edu>, tbray@watsol.waterloo.edu (Tim Bray) writes: |> An example: text database. In a textbase, you must have addressability to the |> byte, not to the record. Also, it is very very convenient to regard all the |> text in your universe as being in one linear address space. 32 bits worth of |> text is not very much text in real-world terms. Here is some 'ls' output from |> a directory containing the electronic Oxford English Dictionary, Second |> Edition, and some supporting files. |> |> -r--r----- 1 tbray 572728830 Sep 7 1989 oed-2e |> -r--r----- 1 tbray 179728816 Sep 7 1989 oed-2e.struct |> -r--r----- 1 tbray 475589360 Sep 8 1989 oed-2e.tree Can you explain to us what these files contain and how the data in them is structured/stored/encoded? |> About 28 bits worth right there. But I want a database with the OED and the |> complete Shakespeare and Chemical Abstracts and the complete Library of |> Congress Catalogue and a couple decades' worth of AP wire service; that's |> almost enough text to be really useful. But seriously folks, there's lots of |> insurance companies and research institutions and government departments with |> *lots* more than 4 Gb sitting around... Isn't there a preferable relative means to address your data? - surely it's more extensible and thus you don't have to worry about the limits of word size. Not that I'm arguing for small words - but is linear addressing of data really the burning issue? How do you manage distributed data? I know .. decode the address into smaller components .. so why do you want long words? Why not use several smaller words to construct an address in the first place? I address these comments refering to your particular data set - text. Do you *really* want a means to linearly address the documents you describe? What particular advantage does this give you over the natural decomposition of the data? When we get to spaces this size would some paging mechanism be preferable? -- Steven Ericsson Zenith * email: zenith@cs.yale.edu Fax: (203) 466 2768 | voice: (203) 432 1278 "The tower should warn the people not to believe in it." - P.D.Ouspensky Yale University Dept of Computer Science 51 Prospect St New Haven CT 06520 USA