Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!husc6!cmcl2!brl-adm!brl-smoke!gwyn
From: gwyn@brl-smoke.ARPA (Doug Gwyn )
Newsgroups: comp.lang.c
Subject: Re: What is a byte
Message-ID: <6216@brl-smoke.ARPA>
Date: Sat, 1-Aug-87 23:29:39 EDT
Article-I.D.: brl-smok.6216
Posted: Sat Aug  1 23:29:39 1987
Date-Received: Sun, 2-Aug-87 10:57:43 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP>
Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>)
Organization: Ballistic Research Lab (BRL), APG, MD.
Lines: 41

In article <851@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes:
>I haven't given this issue a whole lot of thought, but it seems to me that
>"short char" should be the smallest object which is addressible in C, and
>should define the units of sizeof; "long char" should denote whatever is
>necessary to represent the native character set.  On a bit-addressible machine
>in an Arabic- or Japanese-language environment, one might have "short char" be
>1 bit, "char" be 8, and "long char" be 16.

That is a bit more generous than my proposal, but it follows the same line
of thought.  I would prefer that a (char) be capable of holding an entire
basic textual unit, since many applications are already based on that
assumption.  A separate (long char) would necessitate a whole extra
collection of str*()-like library routines, which the portable programmer
would have to be careful to use instead of the str*() functions; might as
well simply make (char) be the right thing and not introduce a new type.

Using all three possible char lengths would not pose a serious problem
if the str*() functions were changed to rquire (long char) and if
implementations made (char) and (long char) the same size, at least for
now.  There aren't many bit-addressable architectures at present (more's
the pity), so most international implementations could make (short char)
8 bits and (char) or (long char) 16 bits.

>If this is to be phased in without breaking a lot of programs, X3J11 should
>immediately bless all three names, but insist that they all be the same size.
>(Which restriction should be deprecated, to be removed in the next standard.)

I don't think it's within the realm of practical politics to say that the
problem will not be solved until the next issue of the standard.  It would
be better if it can be solved now without too much breakage of existing
non-internationalized code.  (Internationalized code is already vendor-
specific, due to lack of agreement on a universal approach.  Any good
solution will require at least some vendors to eventually change.)

On a related note, I see that the /usr/group people are trying to change
regular expressions from character-code based to language alphabet based
(as though there were always a universal collating order for a given
language!).  This is a most unfortunate direction, since it ruins simple,
well-behaved algorithms written in terms of (sufficiently wide) (char)s.
I wish they would not plow ahead with this until the multi-byte character
issue is resolved, since that may well affect the practical possibilities.