Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!cmcl2!husc6!mit-eddie!ll-xn!nike!lll-crg!lll-lcc!pyramid!oliveb!sun!guy
From: guy@sun.uucp (Guy Harris)
Newsgroups: net.lang.c
Subject: Re: sizeof(char)
Message-ID: <9053@sun.uucp>
Date: Fri, 7-Nov-86 19:21:56 EST
Article-I.D.: sun.9053
Posted: Fri Nov  7 19:21:56 1986
Date-Received: Sat, 8-Nov-86 18:39:21 EST
References: <4617@brl-smoke.ARPA> <657@dg_rtp.UUCP>
Organization: Sun Microsystems, Inc.
Lines: 121

> Guy missed the meaning of my reference to bitmap display programming.
> What I really care about in this context is support for direct bit
> addressing.

I am not at all convinced that anybody *should* care about this, at least
from the standpoint of bitmap display programming.  If a vendor permits you
to bang bits on a display, they should provide you with routines to do this;
frame buffers are not all the same, and code that works well on one display
may not work well at all on another.  Furthermore, some hardware may do some
bit-banging operations for you; if you approach the display at the right
level of abstraction, this can be done transparently, but not if you just
write into a bit array.

Furthermore, it's not clear that displays should be programmed at the
bit-array level anyway; James Gosling and David Rosenthal have made what I
consider a very good case against doing this (and no, I don't consider it a
good case just because I work at Sun and we're trying to push NeWS).

> I know for a fact that one reason we don't HAVE this on some current
> architectures is the lack of access to the facility from
> high-level languages.

If that is the case, then the architect made a mistake.  If it's really
important, they can extend the language.  Yes, this means a non-standard
extension; however, the only way to get it to be a standard extension is to
get *every* vendor to adopt it, regardless of whether they support bit
addressing or not.  In the case of C, this means longer-than-32-bit "void *"
on lots of *existing* machines; I don't think the chances of this happening
are very good at all.

> I would like it to be POSSIBLE for some designer of an architecture
> likely to be used for bit-mapped systems to decide to make bits directly
> addressable.

It is ALREADY possible to do this.  The architect merely has to avoid
thinking "if I can't get at this feature from unextended ANSI C, I shouldn't
put it in."  The chances are very slim indeed that there will be a standard
way to do bit addressing in ANSI C, since this would require ANSI C to
mandate that all implementations support it, and would require ANSI C to be
rather more different from current C implementations that most vendors would
like.

> The idea of a "character" is that of an individually manipulable
> primitive unit of text.

As I've already pointed out, it is quite possible that there may be more
than one such notion on a system.

> However, in X3J11 practically everything that now refers to (char)
> arrays is designed principally for text application, while practically
> everything that refers to arbitrary storage uses (void *), not (char *).

However, you're now introducing a *third* type; when you are dealing with
arbitrary storage, sometimes you use "void *" as a pointer to arbitrary
storage and sometimes you use "short char" as an element of arbitrary
storage.

> In a good implementation using my (char)/(short char) distinction, it
> would be POSSIBLE to maintain a reasonable default collating sequence
> for (char)s so that a kludge like strcoll() would not normally be
> necessary.)

This is simply not true, unless the "normally" here is being used as an
escape clause to dismiss many natural languages as abnormal.  Some languages
do *not* sort words with a character-by-character comparison (e.g., German).
One *might* give ligatures like "SS" "char" codes of their own - but you'd
have to deal with existing documents with two "S"es in them, and you'd
either have to convert them "on the fly" in standard I/O (in which case
you'd have to have standard I/O know what language the file was in) or
convert them *en bloc* when you brought the document over from a system with
8-bit "char"s.  (Oh, yes, you'd still have to have standard I/O handle 8-bit
and 16-bit "char"s, and conversion between them, unless you propose to make
this new whizzy machine require text file conversion when you bring files
from or send files to machines with boring obsolete old 8-bit "char"s.)

Furthermore, I don't know how you sort words in Oriental languages, although
I remember people saying there *is* no unique way of sorting them.

> Using (long char) for genuine text characters would conflict with
> existing definitions for text-oriented functions, which is the main
> reason I decided that (char) is STILL the proper type for text units.

If you're going to internationalize an existing program, changing it to use
"lstrcpy" instead of "strcpy" is the least of your worries.  I see no
problem whatsoever with having the existing text-oriented functions handle
8-bit "char"s.  Furthermore, since not every implementation that supports
large character sets is going to adopt 16-bit "char"s, you're going to need
two sets of text-oriented functions in the specification anyway.

> The trade-off is between more compact storage (as in AT&T's approach)
> requiring kludgery to handle individual textual units, versus a clean,
> simple model of characters and storage cells that supports uncomplicated,
> straightforward programming.

What is this "kludgery"?  You need two classes of string manipulation
routines.  Big Deal.  You need to convert some encoded representation in a
file to a 16-bit-character representation when you read the file, and
convert it back when you write it back.  Big Deal.  This would presumably be
handled by library routines.  If you're going to read existing text files
without requireing them to be blessed by a conversion utility, you'll have
to do that in your scheme as well.  You need to remember to properly declare
"char" and "long char" variables, and arrays and pointers to same.  Big Deal.

I am not convinced that the "char"/"long char" scheme is significantly less
"clean", "simple", "uncomplicated", or "straightforward" than the "short
char"/"char" scheme.

> While it is POSSIBLE to run into problems, such as in using the
> result of strlen() as the length of a memcpy() operation, these
> don't arise so often that it is hopeless to make the transition.

Sigh.  No, it isn't necessarily HOPELESS; however, you have not provided ANY
evidence that the various problems caused by changing the meaning of "char"
would be preferable to any disruption to the "clean" models caused by adding
"long char".  (Frankly, I'd rather keep track of two types of string copy
routines and character types than keep track of all the *existing* code that
would have to have "char"s changed to "short char".)
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)