Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!rutgers!seismo!brl-adm!brl-smoke!gwyn From: gwyn@brl-smoke.ARPA (Doug Gwyn ) Newsgroups: net.lang.c Subject: Re: sizeof(char) Message-ID: <5251@brl-smoke.ARPA> Date: Thu, 6-Nov-86 21:18:34 EST Article-I.D.: brl-smok.5251 Posted: Thu Nov 6 21:18:34 1986 Date-Received: Fri, 7-Nov-86 23:39:30 EST References: <4617@brl-smoke.ARPA> <657@dg_rtp.UUCP> <55@cartan.Berkeley.EDU> <663@dg_rtp.UUCP> <5141@brl-smoke.ARPA> <8907@sun.uucp> Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) ) Organization: Ballistic Research Lab (BRL), APG, MD. Lines: 93 Guy missed the meaning of my reference to bitmap display programming. What I really care about in this context is support for direct bit addressing. I know for a fact that one reason we don't HAVE this on some current architectures is the lack of access to the facility from high-level languages. I would like it to be POSSIBLE for some designer of an architecture likely to be used for bit-mapped systems to decide to make bits directly addressable. I know I have often wished that I had bit arrays in C when programming bitmap display applications. The 8-bit byte was an arbitrary packaging decision (made by IBM for the System/360 family, by DEC for the PDP-11, and by some others, but definitely not by EVERY vendor). There are already some 9-, 10-, and 12-bit oriented C implementations; I would like to give implementors the OPTION of choosing to use 16-bit (char)s even if their machine can address individual 8-bit bytes or even individual bits. The idea of a "character" is that of an individually manipulable primitive unit of text. The idea of "byte" is that of an individually addressable unit of storage. From one point of view, it doesn't matter what the two basic types would be called if and when this distinction is made in the C language. However, in X3J11 practically everything that now refers to (char) arrays is designed principally for text application, while practically everything that refers to arbitrary storage uses (void *), not (char *). (The one exception is strcoll(), which specifically produces a (char[]) result; Prosser and I discussed this and agreed that this was acceptable for its intended use. In a good implementation using my (char)/(short char) distinction, it would be POSSIBLE to maintain a reasonable default collating sequence for (char)s so that a kludge like strcoll() would not normally be necessary.) Using (long char) for genuine text characters would conflict with existing definitions for text-oriented functions, which is the main reason I decided that (char) is STILL the proper type for text units. I realize that many major vendors in the international UNIX market have already adopted "solutions" to the problem of "international character sets"; however, each has taken a different approach! There is nothing in my proposal to preclude an implementor from continuing to force sizeof(char)==sizeof(short char) and preserving his previous vendor-specific "solution"; however, what I proposed ALLOWS an implementor to choose a much cleaner solution if he so desires, without forcing him to if he prefers other methods, and it also allows nybble- or bit-addressable architectures to be nicely supported at the C language level. The trade-off is between more compact storage (as in AT&T's approach) requiring kludgery to handle individual textual units, versus a clean, simple model of characters and storage cells that supports uncomplicated, straightforward programming. It happens that the text/binary stream distinction of X3J11 fits the corresponding character/byte distinction very nicely. The only wart is for systems like UNIX that allow mixing of text-stream operations, such as scanf(), with binary-stream operations, such as fread(); there is a potential alignment problem in doing this. (By the way, I also propose new functions [f]getsc()/[f]putsc() for getting/putting single (short char)s; this is necessary for the semantic definition of fread()/fwrite() on binary streams. In my original proposal these were called [f]getbyte()/[f]putbyte(), but the new names are better.) ANY C implementation that makes a real distinction between characters and bytes is going to cause problems for people porting their code to it. The choices are, first, whether to ever make such a distinction, and second, if so, how to do so. I believe the distinction is important, and much prefer a clean solution over one that requires programmers to convert text data arrays back and forth, or to keep track of two sets of otherwise identical library functions. As with function prototypes, a transition period can exist during which (char) and (short char) have the same size, which is no worse than the current situation, and implementors could choose when if ever to split these types apart. Please note that there is not much impact of my proposal on current good C coding practice; for example, the following continue to work no matter what choices the C implementor has made: struct foo bar[SIZE], barcpy; unsigned nelements = sizeof bar / sizeof bar[0]; fread( bar, sizeof(struct foo), SIZE, fp ); fread( bar, sizeof bar, 1, fp ); memcpy( &barcpy, &bar[3], sizeof(struct foo) ); /* the above requires casting anyway if prototype not in scope */ char str[] = "text"; printf( "\"%s\" contains %d characters\n", str, strlen( str ) ); While it is POSSIBLE to run into problems, such as in using the result of strlen() as the length of a memcpy() operation, these don't arise so often that it is hopeless to make the transition. One thing for sure, if we don't make the character/byte distinction POSSIBLE in the formal ANSI C standard, it will be too late to do it later. The absolute minimum necessary is to remove the requirement that sizeof(char)==1 from the standard, although this opens up a hole in the spec that needs plugging by a proposal like mine (X3J11/86-136, revised to fit the latest draft proposed standard and to change the names of the primitive byte get/put functions).