Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!mcvax!enea!kuling!andersa From: andersa@kuling.UUCP (Anders Andersson) Newsgroups: comp.std.internat Subject: Re: Character representation Message-ID: <476@kuling.UUCP> Date: Sat, 22-Aug-87 16:58:50 EDT Article-I.D.: kuling.476 Posted: Sat Aug 22 16:58:50 1987 Date-Received: Sun, 23-Aug-87 21:42:50 EDT References: <2171@enea.UUCP> <709@maccs.UUCP> <2183@enea.UUCP> <719@maccs.UUCP> Reply-To: andersa@kuling.UUCP (Anders Andersson) Organization: Uppsala University, Sweden Lines: 50 In article <719@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes: >t umlaut or q cedilla would probably be used very rarely, nor is it likely >that anyone would go to the trouble of designing a font to accomodate such >characters. Another cost of such generality would be that accents and other >marks would probably have to be indicated by escape sequences in conjunction >with the unmodified letter. This would make string-processing software more >complicated (and slower), and text would be longer. The existance of something like four different 94-character "Latin" sets suggests that 8 bits wouldn't suffice anyway, although I haven't counted the exact number of existing glyph combinations. I believe Welsh includes some strange accented consonants, but I don't remember which (maybe w^). If you also take Vietnamese into account, which allows several accents used at the same time, you'd definitely overflow the table. TeX provides for arbitrary combinations of accents, and I think this approach is quite simple (although I don't suggest TeX for the encoding scheme to be used for files in general). I don't think somebody manually has to design a font for each and every combination, as the acute accent over e looks pretty much the same as the acute accent over o, and the combination could be done automatically at display-time. Some characters will need special treatment though, like capital Swedish A with circle above (they should usually touch each other) and Polish bar-crossed L. The amount of programming and CPU power to be used for this depends on what quality and resolution of display you require. If this general approach turns out to be the most practical one technically, some people may of course go hog wild putting circles under X and cedillas over 7, but there is as little point in stopping them as in preventing people from writing "fiYw#s" with a proportional font. Just apply the general accent attachment rule and they'll be quiet... > if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR) >with very little loss of efficiency. To accomodate perverse languages like >Spanish and Polish which insist on two-letter combinations for sorting, What about the thing "Mac" or "Mc" in English (Scottish?) proper names? I agree this example is a little extreme in comparison to the Spanish "graphemes" ch, ll and rr (?), as well as czech ch. Maybe the English don't mind seeing "McDonald" sorted after "Machiavelli", or whatever the rule is/was - has it been abolished by now? There are different kinds of sorting even within one language, depending on the context. Donald E. Knuth provides a wonderful collection of rules for bibliographic use in the beginning of his "Fundamentals ..." volume on Sorting & Searching, such as ignoring articles and spelling out numbers. These rules don't apply to filenames in a UNIX directory, I think! -- Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden Phone: +46 18 183170 UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)