Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84 + RN 4.3; site inset.UUCP Path: utzoo!linus!philabs!cmcl2!seismo!mcvax!ukc!stc!inset!mikeb From: mikeb@inset.UUCP (Mike Banahan) Newsgroups: net.internat Subject: character sets Message-ID: <719@inset.UUCP> Date: Tue, 8-Oct-85 07:14:47 EDT Article-I.D.: inset.719 Posted: Tue Oct 8 07:14:47 1985 Date-Received: Fri, 11-Oct-85 08:36:55 EDT Reply-To: mikeb@inset.UUCP (Mike Banahan) Organization: The Instruction Set Ltd., London, UK. Lines: 95 Xpath: stc stc-a Pete Delaney suggests that character sets would be a good place to start. He's right - it's a horrible area. The first problem that strikes typical C programmers is how they should represent characters outside the normal ASCII set. They then start thinking about using the `top' bit to extend the range of usable characters up to 255. Somebody throws in a suggestion that the Japanese will want around 7000 (seven thousand) characters, so the next idea is to start using shift sequences. In fact, there are a whole bunch of industry `standards' for this sort of thing. For those of us who can get by with 256 characters, in Western Europe (including Iceland), this is not a bad solution. Draft Iso Standard (DIS) 8859-1 gives us what we think we need. ISO 2022 gives a suggested set of shifting mechanisms which allow the top 128 characters of 8859-1 to be switched on the fly, so that in 8 bits I can produce documents in English, French, German, Icelandic and so on. If I want to throw in some Greek (whose characters *aren't* in the top half of 8859-1), then I can use either a locking or non-locking shift sequence which say `the top 128 characters are now some other set' and in this way get some Greek in there. And so it goes on, up to ways of getting 16 bit characters. In fact one of the problems here is that there are so many standards, that there aren't any, if you see what I mean. But there are problems. First, characters aren't fixed length any more. You should see what *that* does to C code. Fixed length arrays aren't fixed in length any more, you can't index into them to find the nth character, because if it's preceded by a shift code it will mean something else. Toupper() and tolower() have to be warned what the current top half of the codeset is. And much, much more. Moving on from character sets to interpreting their meaning, we tread on a particularly obnoxious little serpent: Regular Expressions. This is a famous little problem in its own right, and it is caused by ranges in REs. If the current codeset doesn't use a consecutive encoding for the characters in its repertoire, what does [a-z] mean????? It's more obvious with a concrete example: let's use German and the convention that means u with an umlaut. What does the last regular expression mean. Does it include or not? Does it really mean "all alphabetic characters" (in which case does it embrace Greek alpha through omega?) and if it does, does it include vowels with an umlaut? If not, do they have to be put in explicitly? How, if I want to, do I write a regular expression explicitly to match all alphabetic characters with or without umlaute? How, with grep, do I find only those lines with at least one umlaut? This problem rolls on and on and on. It's even better with the kanji ideographic languages :-). Collating sequences become very interesting round about now - but that's a whole article to itself! Back to character encoding methods. The current AT&T proposals are based on ISO 2022, in a draft document released to the /usr/group/uk working party, dated June 24th, 1985. Copies of it, and other relevant literature received so far, can be obtained by writing to Mrs. J. H. Burley, Secretary, /usr/group/uk, 8, Chequer Street, St. Albans, Herts, AL1, 3YJ England. and saying that you wish to be put on the Internationalisation mailing list. For my own part, I believe that the discussions on how to encode stuff is premature. I think that it is more important to find out what `characters' the users want first. If a solution that cannot easily handle such features as `all european and asian characters in different fonts and point sizes' is proposed, yet the users want exactly those features, then we have let them down. If they say that they have got used to working in English and don't want anything different, then there is no point in changing. Though we know for a fact that the latter, English only, is already not an option. The time may have come for a much more radical solution, with an abstract object-oriented view of character handling. I am personally convinced that it has, and have prepared a paper on the topic for those who wish to see it. It a a little large to post to the news net, but I will mail it to those who want to see it. (It uses pic for the diagrams; sorry). Please, let's see some real debate on these topics. THEY MATTER. Internationalisation may be the next big hurdle for computers to overcome. Users want to use their own language and characters; the technical problems are fascinating and the market opportunities immense! -- Mike Banahan, Technical Director, The Instruction Set Ltd. mcvax!ukc!inset!mikeb