Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!columbia!rutgers!ll-xn!adelie!axiom!linus!philabs!mcnc!duke!jwg
From: jwg@duke.UUCP (Jeffrey William Gillette)
Newsgroups: net.lang.c,net.micro.pc
Subject: Signed Chars - What Foolishness Revisited!
Message-ID: <8776@duke.duke.UUCP>
Date: Sat, 1-Nov-86 10:52:29 EST
Article-I.D.: duke.8776
Posted: Sat Nov  1 10:52:29 1986
Date-Received: Tue, 4-Nov-86 03:32:29 EST
Organization: Humanities Computing Center, Duke University
Lines: 104
Xref: mnetor net.lang.c:6261 net.micro.pc:7507

[]

A few weeks ago I vented my hostilities on MSC's support (or lack 
thereof) for extended ASCII characters - specifically for their
decision to make type 'char' default to a signed quantity.  I asked
if other compilers defaulted to signed, and what justification existed
for such a policy.  I would like to thank those who were kind enough
to respond to my questions, summarize the arguments as I understand
them, and come back for a rebuttal.

1)	Microsoft C

MSC does, in fact, claim quite explicitly in the library manual that
'isupper', 'islower', etc. are defined only when 'isascii' is true.
Thus, with regards to my original complaint about 'isupper', the 
compiler is not broken, it is simply wrong!  

The MSC "Language Reference" distinguishes two types of character
sets.  The "representable" character set includes all symbols which
are meaningful to the host system.  The "C" character set, a subset
of the former, includes all characters which have meaning to the compiler.
I assume this distinction allows, e.g. the compiler to process strings
containing non-ASCII characters, or to handle quoted non-ASCII 
characters in 'if' or 'case' statements.

It seems to me that any 'isbar' macro *ought* to apply to the full set
of characters which can be represented in the system, not only to those
used by the compiler.  For the PCDOS environment this includes characters
with umlauts, acute and grave accents, etc.  Thus I argue that Microsoft
has made the wrong decision in failing to support the full character
environment of their target system.

2)	Signed char default

It appears that an accident of history - the architecture of the PDP-11 -
brought about the implementation of 'signed' chars.  Since then there 
appears to be a split between compilers that default to signed chars 
and those that default to unsigned.

The only argument for signed char default appears to be that some old 
PDP and VAX code will break without signed char defaults.  I could say
that this seems to me a better argument for rewriting the faulty code,
but I understand why many implementors do not want to rewrite large
amounts of established utilities.

I would suggest that the proper way to handle portability problems is
that of (believe it or not) the Microsoft 4.0 compiler.  Several of you
called attention to the new command line switch that will default chars
to unsigned.  This seems a relatively painless way to support code that
requires char defaults.  My bone of contention, however, is that this
scheme is exactly backwards.  Code that uses signed chars will not handle
half of the system's character set, and thus I must deliberately and
consciously choose to set a command line switch every time I compile
a program, or my program will not work acceptably on my system!

3)	What is a 'char' anyway?

Some of you called attention to K&R's discussions of the char type.
K&R definitely present 'char' as system specific.

	a single byte, capable of holding one character in 
	the local character set. (p. 34)

Following this statement is a table in which presents the 'char' type
as 8-bit ASCII on the PDP-11, 9-bit ASCII on the Honeywell 6000,
8-bit EBCDIC on the IBM 370, and 8-bit ASCII on the Interdata 8/32.
On the following page is an explanation of character constants and
the differing numerical values associated with '0' in ASCII and EBCDIC.

My point is that K&R clearly sets forth the 'char' type as a logical
quantity which is implementation specific.  They are willing to 
include ASCII and EBCDIC in the definition, and, I assume, any other
arbitrary representation scheme that will fit into "a single byte".
By this definition, any code that depends on the mathematical properties 
of characters (e.g. that, in ASCII, A-Z and a-z are contiguous) is 
inherently non-portable!

4)	What difference does it make?

None - if we want to continue to insist that English is the official
language of C and UNIX!  There is, however, a market of people who 
want to sed with ninyas or awk with cedillas.  There may, in fact,
be a system just around the corner for users who want to diff in
Kanji!  Unfortunately all of these are out of luck, since the afore-
mentioned code only works with 7-bit characters.  At this point in
time I am still trying to explain to my colleagues in the Humanities
Computing Lab why their new $10,000 Apollo supermicro can't display
a simple umlaut!

I guess the point of this rave should be summarized.  Now that hardware
no longer restricts us to 7-bit character sets, isn't it time we see
*forward* compatible compilers that default to the native character
set of their host system, and isn't it time we start writing (or 
rewriting) portable UNIX code that will work on systems whether 
characters display in ASCII, EBCDIC, Swedish, or Amharic!


Jeffrey William Gillette		uucp: mcnc!ethos!ducall!jeff
Humanities Computing Facility		bitnet: DYBBUK @ TUCCVM
Duke University
-- 
Jeffrey William Gillette	uucp:  mcnc!ethos!ducall!jeff
Humanities Computing Facility 	bitnet: DYBBUK @ TUCCVM
Duke University