Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84 + RN 4.3; site inset.UUCP
Path: utzoo!linus!philabs!cmcl2!seismo!mcvax!ukc!stc!inset!mikeb
From: mikeb@inset.UUCP (Mike Banahan)
Newsgroups: net.internat
Subject: character sets
Message-ID: <719@inset.UUCP>
Date: Tue, 8-Oct-85 07:14:47 EDT
Article-I.D.: inset.719
Posted: Tue Oct  8 07:14:47 1985
Date-Received: Fri, 11-Oct-85 08:36:55 EDT
Reply-To: mikeb@inset.UUCP (Mike Banahan)
Organization: The Instruction Set Ltd., London, UK.
Lines: 95
Xpath: stc stc-a

Pete Delaney suggests that character sets would be a good place to start.

He's right - it's a horrible area.

The first problem that strikes typical C programmers is how they should
represent characters outside the normal ASCII set. They then start thinking
about using the `top' bit to extend the range of usable characters up to 255.
Somebody throws in a suggestion that the Japanese will want around 7000
(seven thousand) characters, so the next idea is to start using shift
sequences.

In fact, there are a whole bunch of industry `standards' for this sort
of thing. For those of us who can get by with 256 characters, in Western
Europe (including Iceland), this is not a bad solution. Draft Iso Standard
(DIS) 8859-1 gives us what we think we need. ISO 2022 gives a suggested set
of shifting mechanisms which allow the top 128 characters of 8859-1 to be
switched on the fly, so that in 8 bits I can produce documents
in English, French, German, Icelandic and so on. If I want to throw in some
Greek (whose characters *aren't* in the top half of 8859-1), then
I can use either a locking or non-locking shift sequence which say `the
top 128 characters are now some other set' and in this way get some Greek
in there.

And so it goes on, up to ways of getting 16 bit characters.
In fact one of the problems here is that there are so many standards,
that there aren't any, if you see what I mean.

But there are problems. First, characters aren't fixed length any more.
You should see what *that* does to C code. Fixed length arrays aren't
fixed in length any more, you can't index into them to find the nth
character, because if it's preceded by a shift code it will mean something
else.

Toupper() and tolower() have to be warned what the current top half
of the codeset is.

And much, much more.

Moving on from character sets to interpreting their meaning, we tread
on a particularly obnoxious little serpent: Regular Expressions.
This is a famous little problem in its own right, and it is caused
by ranges in REs. If the current codeset doesn't use a consecutive
encoding for the characters in its repertoire, what does
	[a-z]
mean?????
It's more obvious with a concrete example: let's use German and the
convention that <u"> means u with an umlaut. What does the last
regular expression mean. Does it include <u"> or not? Does it really
mean "all alphabetic characters" (in which case does it embrace
Greek alpha through omega?) and if it does, does it include vowels
with an umlaut? If not, do they have to be put in explicitly?
How, if I want to, do I write a regular expression explicitly to match
all alphabetic characters with or without umlaute?
How, with grep, do I find only those lines with at least one umlaut?
This problem rolls on and on and on. It's even better with the kanji
ideographic languages :-).

Collating sequences become very interesting round about now - but that's
a whole article to itself!

Back to character encoding methods. The current AT&T proposals are based
on ISO 2022, in a draft document released to the /usr/group/uk working
party, dated June 24th, 1985. Copies of it, and other relevant literature
received so far, can be obtained by writing to
	Mrs. J. H. Burley,
	Secretary,
	/usr/group/uk,
	8, Chequer Street,
	St. Albans, Herts,
	AL1, 3YJ
	England.
and saying that you wish to be put on the Internationalisation mailing list.

For my own part, I believe that the discussions on how to encode stuff is
premature. I think that it is more important to find out what `characters'
the users want first. If a solution that cannot easily handle such features
as `all european and asian characters in different fonts and point sizes'
is proposed, yet the users want exactly those features, then we have let them
down. If they say that they have got used to working in English and don't
want anything different, then there is no point in changing.

Though we know for a fact that the latter, English only, is already not an
option. The time may have come for a much more radical solution, with
an abstract object-oriented view of character handling. I am personally
convinced that it has, and have prepared a paper on the topic for those who
wish to see it. It a a little large to post to the news net, but I will
mail it to those who want to see it. (It uses pic for the diagrams; sorry).

Please, let's see some real debate on these topics. THEY MATTER.
Internationalisation may be the next big hurdle for computers to overcome.
Users want to use their own language and characters; the technical
problems are fascinating and the market opportunities immense!
-- 
Mike Banahan, Technical Director, The Instruction Set Ltd.
mcvax!ukc!inset!mikeb