Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!rutgers!seismo!brl-adm!brl-smoke!gwyn
From: gwyn@brl-smoke.ARPA (Doug Gwyn )
Newsgroups: net.lang.c
Subject: Re: sizeof(char)
Message-ID: <5251@brl-smoke.ARPA>
Date: Thu, 6-Nov-86 21:18:34 EST
Article-I.D.: brl-smok.5251
Posted: Thu Nov  6 21:18:34 1986
Date-Received: Fri, 7-Nov-86 23:39:30 EST
References: <4617@brl-smoke.ARPA> <657@dg_rtp.UUCP> <55@cartan.Berkeley.EDU> <663@dg_rtp.UUCP> <5141@brl-smoke.ARPA> <8907@sun.uucp>
Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>)
Organization: Ballistic Research Lab (BRL), APG, MD.
Lines: 93

Guy missed the meaning of my reference to bitmap display programming.
What I really care about in this context is support for direct bit
addressing.  I know for a fact that one reason we don't HAVE this on
some current architectures is the lack of access to the facility from
high-level languages.  I would like it to be POSSIBLE for some designer
of an architecture likely to be used for bit-mapped systems to decide
to make bits directly addressable.  I know I have often wished that I
had bit arrays in C when programming bitmap display applications.

The 8-bit byte was an arbitrary packaging decision (made by IBM for
the System/360 family, by DEC for the PDP-11, and by some others, but
definitely not by EVERY vendor).  There are already some 9-, 10-, and
12-bit oriented C implementations; I would like to give implementors
the OPTION of choosing to use 16-bit (char)s even if their machine can
address individual 8-bit bytes or even individual bits.

The idea of a "character" is that of an individually manipulable
primitive unit of text.  The idea of "byte" is that of an individually
addressable unit of storage.  From one point of view, it doesn't matter
what the two basic types would be called if and when this distinction is
made in the C language.  However, in X3J11 practically everything that
now refers to (char) arrays is designed principally for text application,
while practically everything that refers to arbitrary storage uses
(void *), not (char *).  (The one exception is strcoll(), which
specifically produces a (char[]) result; Prosser and I discussed this
and agreed that this was acceptable for its intended use.  In a good
implementation using my (char)/(short char) distinction, it would be
POSSIBLE to maintain a reasonable default collating sequence for (char)s
so that a kludge like strcoll() would not normally be necessary.)
Using (long char) for genuine text characters would conflict with
existing definitions for text-oriented functions, which is the main
reason I decided that (char) is STILL the proper type for text units.

I realize that many major vendors in the international UNIX market
have already adopted "solutions" to the problem of "international
character sets"; however, each has taken a different approach!  There
is nothing in my proposal to preclude an implementor from continuing
to force sizeof(char)==sizeof(short char) and preserving his previous
vendor-specific "solution"; however, what I proposed ALLOWS an
implementor to choose a much cleaner solution if he so desires,
without forcing him to if he prefers other methods, and it also allows
nybble- or bit-addressable architectures to be nicely supported at the
C language level.  The trade-off is between more compact storage (as
in AT&T's approach) requiring kludgery to handle individual textual
units, versus a clean, simple model of characters and storage cells
that supports uncomplicated, straightforward programming.

It happens that the text/binary stream distinction of X3J11 fits the
corresponding character/byte distinction very nicely.  The only wart
is for systems like UNIX that allow mixing of text-stream operations,
such as scanf(), with binary-stream operations, such as fread(); there
is a potential alignment problem in doing this.  (By the way, I also
propose new functions [f]getsc()/[f]putsc() for getting/putting single
(short char)s; this is necessary for the semantic definition of
fread()/fwrite() on binary streams.  In my original proposal these
were called [f]getbyte()/[f]putbyte(), but the new names are better.)

ANY C implementation that makes a real distinction between characters
and bytes is going to cause problems for people porting their code
to it.  The choices are, first, whether to ever make such a distinction,
and second, if so, how to do so.  I believe the distinction is
important, and much prefer a clean solution over one that requires
programmers to convert text data arrays back and forth, or to keep
track of two sets of otherwise identical library functions.  As with
function prototypes, a transition period can exist during which (char)
and (short char) have the same size, which is no worse than the current
situation, and implementors could choose when if ever to split these
types apart.

Please note that there is not much impact of my proposal on current
good C coding practice; for example, the following continue to work
no matter what choices the C implementor has made:

	struct foo bar[SIZE], barcpy;
	unsigned nelements = sizeof bar / sizeof bar[0];
	fread( bar, sizeof(struct foo), SIZE, fp );
	fread( bar, sizeof bar, 1, fp );
	memcpy( &barcpy, &bar[3], sizeof(struct foo) );
	/* the above requires casting anyway if prototype not in scope */

	char str[] = "text";
	printf( "\"%s\" contains %d characters\n", str, strlen( str ) );

While it is POSSIBLE to run into problems, such as in using the
result of strlen() as the length of a memcpy() operation, these
don't arise so often that it is hopeless to make the transition.
One thing for sure, if we don't make the character/byte distinction
POSSIBLE in the formal ANSI C standard, it will be too late to do
it later.  The absolute minimum necessary is to remove the
requirement that sizeof(char)==1 from the standard, although this
opens up a hole in the spec that needs plugging by a proposal like
mine (X3J11/86-136, revised to fit the latest draft proposed standard
and to change the names of the primitive byte get/put functions).