Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!cbosgd!ihnp4!inuxc!iuvax!bsu-cs!neubauer From: neubauer@bsu-cs.UUCP Newsgroups: comp.std.internat,sci.lang Subject: Re: Computers and human languages (was Re: What is a byte) Message-ID: <1043@bsu-cs.UUCP> Date: Sat, 22-Aug-87 20:04:50 EDT Article-I.D.: bsu-cs.1043 Posted: Sat Aug 22 20:04:50 1987 Date-Received: Sun, 23-Aug-87 19:36:08 EDT References: <111@quick.UUCP> <2842@ulysses.homer.nj.att.com> Organization: CS Dept, Ball St U, Muncie, Indiana Lines: 45 Xref: utgpu comp.std.internat:148 sci.lang:1125 In article <111@quick.UUCP>, srg@quick.UUCP (Spencer Garrett) writes: > I was told once (by a respected linguist, as I recall) that English and > Russian are the ONLY two languages written with unaccented alphabets. I > know you have to add the qualifier "modern" to make that true, and maybe > "major" as well, although I don't know of any exceptions right off. I don't ^^^^^ that might do it > know whether he didn't count Katakana and Hiragana as alphabets or whether They are not alphabets, they are syllabaries, i.e., each symbol represents a whole syllable. > one cannot (or normally would not) write Japanese entirely in one or both > of these scripts. He seemed to think that an unaccented alphabet was a > substantial advantage in an information age, and I would tend to agree. So would I. In article <2842@ulysses.homer.nj.att.com>, jss@hector..UUCP (Jerry Schwarz) writes: > I quote from a draft of the Rationale of the proposed > ANSI C standard, section 4.4: > The English language uses 26 letters derived from the > Latin alphabet. The set of letters suffices for English, > Swahili, and Hawaiian; all other living languages use > either the Latin aphabet plus other characters, or other > non Latin aphabets or syllabaries. > They cite no reference for this piece of trivia. Just as well, since it is not true. Another counterexample (from off the top of my head): Hmong. If necessary, we could undoubtedly come up with more, but there is really no point. We don't really need to worry about it for C programs, since the characters needed for that are already known. What we do need to worry about is how to set up computer facilities, e.g., keyboards, and how to represent the modified letters in languages that DO have diacritics. It has already been established that simply using the high bit of an 8-bit byte for +/- modified will not do, both because of multiple diacritics for a single letter in a given language, and also because of multi-lingual text. It is certainly far less elegant to simply assign a byte from the upper 1/2 of the byte range (i.e. with high bit set) to each known modified letter. If we stick to the Latin alphabet, though, there are probably enough unassigned bytes to do it. That will leave very odd sets of bit patterns to represent the letters of a given language, but the alternative would appear to be to scrap ASCII altogether if we intend to make some kind of rational scheme of it. -- Paul Neubauer UUCP: {ihnp4,seismo}!{iuvax,pur-ee}!bsu-cs!neubauer