Xref: utzoo comp.emacs:4563 comp.lang.c:13817 comp.sys.ibm.pc:20856
Path: utzoo!utgpu!attcan!uunet!seismo!sundc!pitstop!sun!decwrl!labrea!rutgers!tut.cis.ohio-state.edu!cwjcc!hal!nic.MR.NET!umn-d-ub!umn-cs!bungia!jhereg!mark
From: mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn)
Newsgroups: comp.emacs,comp.lang.c,comp.sys.ibm.pc
Subject: Re: Programming and international character sets.
Message-ID: <208@jhereg.Jhereg.MN.ORG>
Date: 3 Nov 88 14:36:59 GMT
References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> <207@jhereg.Jhereg.MN.ORG> <621@quintus.UUCP>
Reply-To: mark@jhereg.MN.ORG (Mark H. Colburn)
Organization: NAPS International, St. Paul, MN
Lines: 39

In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>The kludges being proposed for C & UNIX just so that a sequence of
>"international" characters can be accessed as bytes rather than pay
>the penalty of switching over to 16 bits are unbelievable.

There is more to it than just moving to 16 bit characters.  There are a
number of places where a character sequence needs to be recognized.  Often
that character sequence is in 8-bit or 7-bit ASCII.

The draft of ANSI and POSIX both have the notion of coallation sequences;
that is, some idea of how to sort characters in different locales.  The
collation sequence can vary from locale to locale.   I would encourage 
you to look in the draft C standard for more details.  Collation sequences
can be used for more than just internationalization, however.

Consider the phone book which all of us have sitting around.  In the US and
most other English speaking countries, the phone book has some rather ood 
collation sequences in it.  Most notably any names beginning with "Mc" or 
"Mac" come before "M" in the phone book.  It would be useful for some 
applications to define a collation sequence which would provide that 
particular behaviour.  

Now then, "Mc" and "Mac" are not (and should not) be represented as 16 bit 
characters.  Other examples include the German ss character, which could be
represented as a unique character, but most Germans would still type 'ss'
rather than hunting for a new key.

16-bit characters are good for some things, such as Kanji or other Asian
code sets, but may be less useful in a number of other areas.  Requiring 
16-bit characters puts a large burden of unused memory on those applications 
which only use 8-bit characters.  For that reason alone, ANSI would be 
justfied in not requiring 16-bit characters.  However, I don't beleive that 
there is anything in the standard which would preclude a conforming ANSI C 
implementation from having 16-bit characters.

-- 
Mark H. Colburn                  "They didn't understand a different kind of 
NAPS International                smack was needed, than the back of a hand, 
mark@jhereg.mn.org                something else was always needed."