Xref: utzoo comp.emacs:4563 comp.lang.c:13817 comp.sys.ibm.pc:20856 Path: utzoo!utgpu!attcan!uunet!seismo!sundc!pitstop!sun!decwrl!labrea!rutgers!tut.cis.ohio-state.edu!cwjcc!hal!nic.MR.NET!umn-d-ub!umn-cs!bungia!jhereg!mark From: mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) Newsgroups: comp.emacs,comp.lang.c,comp.sys.ibm.pc Subject: Re: Programming and international character sets. Message-ID: <208@jhereg.Jhereg.MN.ORG> Date: 3 Nov 88 14:36:59 GMT References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> <207@jhereg.Jhereg.MN.ORG> <621@quintus.UUCP> Reply-To: mark@jhereg.MN.ORG (Mark H. Colburn) Organization: NAPS International, St. Paul, MN Lines: 39 In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >The kludges being proposed for C & UNIX just so that a sequence of >"international" characters can be accessed as bytes rather than pay >the penalty of switching over to 16 bits are unbelievable. There is more to it than just moving to 16 bit characters. There are a number of places where a character sequence needs to be recognized. Often that character sequence is in 8-bit or 7-bit ASCII. The draft of ANSI and POSIX both have the notion of coallation sequences; that is, some idea of how to sort characters in different locales. The collation sequence can vary from locale to locale. I would encourage you to look in the draft C standard for more details. Collation sequences can be used for more than just internationalization, however. Consider the phone book which all of us have sitting around. In the US and most other English speaking countries, the phone book has some rather ood collation sequences in it. Most notably any names beginning with "Mc" or "Mac" come before "M" in the phone book. It would be useful for some applications to define a collation sequence which would provide that particular behaviour. Now then, "Mc" and "Mac" are not (and should not) be represented as 16 bit characters. Other examples include the German ss character, which could be represented as a unique character, but most Germans would still type 'ss' rather than hunting for a new key. 16-bit characters are good for some things, such as Kanji or other Asian code sets, but may be less useful in a number of other areas. Requiring 16-bit characters puts a large burden of unused memory on those applications which only use 8-bit characters. For that reason alone, ANSI would be justfied in not requiring 16-bit characters. However, I don't beleive that there is anything in the standard which would preclude a conforming ANSI C implementation from having 16-bit characters. -- Mark H. Colburn "They didn't understand a different kind of NAPS International smack was needed, than the back of a hand, mark@jhereg.mn.org something else was always needed."