Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2.fluke 9/24/84; site tpvax.fluke.UUCP Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!vax135!cornell!uw-beaver!fluke!inc From: inc@fluke.UUCP (Gary Benson) Newsgroups: net.internat,net.nlang Subject: Re: Texts in other languages Message-ID: <738@tpvax.fluke.UUCP> Date: Mon, 2-Dec-85 13:45:20 EST Article-I.D.: tpvax.738 Posted: Mon Dec 2 13:45:20 1985 Date-Received: Tue, 3-Dec-85 20:21:00 EST References: <517@harvard.UUCP> Organization: John Fluke Mfg. Co., Inc., Everett, WA Lines: 34 Keywords: Standard texts, compression Xref: watmath net.internat:106 net.nlang:3828 > For an experiment in text compression, I would find it useful to have > a collection of texts in a variety of languages. Ideally, I would > like a half-dozen distinct texts, each 2000-15000 words long, in each > language. The texts should be in a consistent (documented) > transcription, preferably without formatting commands. The texts need > not be selected to be `representative' of the language. For instance, > technical papers are fine. The languages in which I am interested are > French, Italian, German, (Modern) Greek, Arabic, and Turkish. If you > have texts in other languages, please let me know. > > If you could send me mail describing the texts you might be able to > provide, we can find some way of transferring them later. > > Thanks > -s > > Macrakis@Harvard.{Harvard.EDU,ARPA,uucp,csnet} > @Harvunxh.bitnet When Xerox bought Diablo and got into the high-speed character printer business, they used a computer to generate a paragraph of "standard English text". It was nonsense to read of course, but contained the proper distribution of letters, word lengths, and sentence lengths. They used the text to time their printers. Why not use a "standard text" from each of the languages you are interested in? It seems to me that way you would get a clearer picture of the perfomance of a compression algorithm *on text in the language of interest* than you will with a transcription, which to my mind at least is really just a term meaning a translation. -- Gary Benson * John Fluke Mfg. Co. * PO Box C9090 * Everett WA * 98206 MS/232-E = = {allegra} {uw-beaver} !fluke!inc = = (206)356-5367 _-_-_-_-_-_-_-_-ascii is our god and unix is his profit-_-_-_-_-_-_-_-_-_-_-_