Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!mcvax!enea!sommar
From: sommar@enea.UUCP (Erland Sommarskog)
Newsgroups: comp.std.internat
Subject: Character representation
Message-ID: <2171@enea.UUCP>
Date: Tue, 11-Aug-87 17:08:37 EDT
Article-I.D.: enea.2171
Posted: Tue Aug 11 17:08:37 1987
Date-Received: Fri, 14-Aug-87 01:43:29 EDT
Reply-To: sommar@enea.UUCP(Erland Sommarskog)
Followup-To: comp.std.internat
Organization: ENEA DATA Svenska AB, Sweden
Lines: 30

Two things have inspired me to this article:
1) The reading of the (proposed) standards ISO/Latin 1-4.
2) The discussion "What is a byte".
Reading the standards you discover that there is a whole lot of
letters you never dreamt of, but still there is something common.
(I'm only talking latin letters, but it applies to Greek and Cyrllian
as well.) With some few exceptions it is the same letters that 
reappears, they are just modified in some ways. They have accents,
cedillas, ring, dots, strokes etc. Thus, many are combinations of
two or more characters.
  
The standards is an attempt to satisfy the requirements for the
different languages by assigning each combination an integer
value. But isn't a character a more complicated data type
than just a simple enumeration type? In some languages the
combination may constitute a new letter ("a" with ring and dots,
"o" with dots in Swedish), in other you can apply accents and
other signs without affecting the sorting. (E.g. French, Italian)
  I think that the simple represenatation for charcters is completely
due the dominating position of the English language in the computer
world. If computers had been invented in France the problem would
have been solved. (And if they had been Swedish, Englishmen would
have to accept "v" and "w" being equivalent.)
  The conclusion is that a more sofisticated approach muct be taken.
However, I must admit that I do not have any bright proposals right
now, yet think of it!-- 

Erland Sommarskog       
ENEA Data, Stockholm    
sommar@enea.UUCP