Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!mailrus!uflorida!haven!adm!smoke!gwyn From: gwyn@smoke.BRL.MIL (Doug Gwyn ) Newsgroups: comp.lang.c Subject: Re: Programming and international character sets. Message-ID: <8822@smoke.BRL.MIL> Date: 2 Nov 88 17:58:14 GMT References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> <207@jhereg.Jhereg.MN.ORG> <621@quintus.UUCP> Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) ) Organization: Ballistic Research Lab (BRL), APG, MD. Lines: 22 In article <621@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: >The kludges being proposed for C & UNIX just so that a sequence of >"international" characters can be accessed as bytes rather than pay >the penalty of switching over to 16 bits are unbelievable. From time to time I remind people that "byte" does not imply 8 bits. There is nothing in the proposed C standard that precludes an implementation choosing to use 16 bits for its character types, and/or providing "stub" functions for the locale and wide-character stuff. The main reason all the extra specification for multibyte character sequences is present is that a majority of vendors already had decided to take such an approach as opposed to the much cleaner method of allocating sufficiently wide data to handle all relevant code sets. To accommodate existing approaches, it was necessary to come up with adequate specifications, which has been done. The main problem we face with 16-bit chars is that a majority of X3J11 insisted that sizeof(char)==1, so the smallest C-addressable unit (i.e. "byte") is necessarily the same size as char. Thus, in an implementation based on an 8-bit byte-addressable architecture, if individual byte accessibility is desired in C, the implementation must necessarily make chars 8 bits, and if large code sets are necessary, then it HAS to use multibyte sequences for them.