Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!mcvax!enea!sommar From: sommar@enea.UUCP (Erland Sommarskog) Newsgroups: comp.std.internat Subject: Character representation Message-ID: <2171@enea.UUCP> Date: Tue, 11-Aug-87 17:08:37 EDT Article-I.D.: enea.2171 Posted: Tue Aug 11 17:08:37 1987 Date-Received: Fri, 14-Aug-87 01:43:29 EDT Reply-To: sommar@enea.UUCP(Erland Sommarskog) Followup-To: comp.std.internat Organization: ENEA DATA Svenska AB, Sweden Lines: 30 Two things have inspired me to this article: 1) The reading of the (proposed) standards ISO/Latin 1-4. 2) The discussion "What is a byte". Reading the standards you discover that there is a whole lot of letters you never dreamt of, but still there is something common. (I'm only talking latin letters, but it applies to Greek and Cyrllian as well.) With some few exceptions it is the same letters that reappears, they are just modified in some ways. They have accents, cedillas, ring, dots, strokes etc. Thus, many are combinations of two or more characters. The standards is an attempt to satisfy the requirements for the different languages by assigning each combination an integer value. But isn't a character a more complicated data type than just a simple enumeration type? In some languages the combination may constitute a new letter ("a" with ring and dots, "o" with dots in Swedish), in other you can apply accents and other signs without affecting the sorting. (E.g. French, Italian) I think that the simple represenatation for charcters is completely due the dominating position of the English language in the computer world. If computers had been invented in France the problem would have been solved. (And if they had been Swedish, Englishmen would have to accept "v" and "w" being equivalent.) The conclusion is that a more sofisticated approach muct be taken. However, I must admit that I do not have any bright proposals right now, yet think of it!-- Erland Sommarskog ENEA Data, Stockholm sommar@enea.UUCP