Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!mailrus!uflorida!haven!adm!smoke!gwyn
From: gwyn@smoke.BRL.MIL (Doug Gwyn )
Newsgroups: comp.lang.c
Subject: Re: Programming and international character sets.
Message-ID: <8822@smoke.BRL.MIL>
Date: 2 Nov 88 17:58:14 GMT
References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> <207@jhereg.Jhereg.MN.ORG> <621@quintus.UUCP>
Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>)
Organization: Ballistic Research Lab (BRL), APG, MD.
Lines: 22

In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>The kludges being proposed for C & UNIX just so that a sequence of
>"international" characters can be accessed as bytes rather than pay
>the penalty of switching over to 16 bits are unbelievable.

From time to time I remind people that "byte" does not imply 8 bits.
There is nothing in the proposed C standard that precludes an
implementation choosing to use 16 bits for its character types,
and/or providing "stub" functions for the locale and wide-character
stuff.  The main reason all the extra specification for multibyte
character sequences is present is that a majority of vendors already
had decided to take such an approach as opposed to the much cleaner
method of allocating sufficiently wide data to handle all relevant
code sets.  To accommodate existing approaches, it was necessary to
come up with adequate specifications, which has been done.

The main problem we face with 16-bit chars is that a majority of
X3J11 insisted that sizeof(char)==1, so the smallest C-addressable
unit (i.e. "byte") is necessarily the same size as char.  Thus, in
an implementation based on an 8-bit byte-addressable architecture,
if individual byte accessibility is desired in C, the implementation
must necessarily make chars 8 bits, and if large code sets are
necessary, then it HAS to use multibyte sequences for them.