Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!columbia!rutgers!ll-xn!adelie!axiom!linus!philabs!mcnc!duke!jwg From: jwg@duke.UUCP (Jeffrey William Gillette) Newsgroups: net.lang.c,net.micro.pc Subject: Signed Chars - What Foolishness Revisited! Message-ID: <8776@duke.duke.UUCP> Date: Sat, 1-Nov-86 10:52:29 EST Article-I.D.: duke.8776 Posted: Sat Nov 1 10:52:29 1986 Date-Received: Tue, 4-Nov-86 03:32:29 EST Organization: Humanities Computing Center, Duke University Lines: 104 Xref: mnetor net.lang.c:6261 net.micro.pc:7507 [] A few weeks ago I vented my hostilities on MSC's support (or lack thereof) for extended ASCII characters - specifically for their decision to make type 'char' default to a signed quantity. I asked if other compilers defaulted to signed, and what justification existed for such a policy. I would like to thank those who were kind enough to respond to my questions, summarize the arguments as I understand them, and come back for a rebuttal. 1) Microsoft C MSC does, in fact, claim quite explicitly in the library manual that 'isupper', 'islower', etc. are defined only when 'isascii' is true. Thus, with regards to my original complaint about 'isupper', the compiler is not broken, it is simply wrong! The MSC "Language Reference" distinguishes two types of character sets. The "representable" character set includes all symbols which are meaningful to the host system. The "C" character set, a subset of the former, includes all characters which have meaning to the compiler. I assume this distinction allows, e.g. the compiler to process strings containing non-ASCII characters, or to handle quoted non-ASCII characters in 'if' or 'case' statements. It seems to me that any 'isbar' macro *ought* to apply to the full set of characters which can be represented in the system, not only to those used by the compiler. For the PCDOS environment this includes characters with umlauts, acute and grave accents, etc. Thus I argue that Microsoft has made the wrong decision in failing to support the full character environment of their target system. 2) Signed char default It appears that an accident of history - the architecture of the PDP-11 - brought about the implementation of 'signed' chars. Since then there appears to be a split between compilers that default to signed chars and those that default to unsigned. The only argument for signed char default appears to be that some old PDP and VAX code will break without signed char defaults. I could say that this seems to me a better argument for rewriting the faulty code, but I understand why many implementors do not want to rewrite large amounts of established utilities. I would suggest that the proper way to handle portability problems is that of (believe it or not) the Microsoft 4.0 compiler. Several of you called attention to the new command line switch that will default chars to unsigned. This seems a relatively painless way to support code that requires char defaults. My bone of contention, however, is that this scheme is exactly backwards. Code that uses signed chars will not handle half of the system's character set, and thus I must deliberately and consciously choose to set a command line switch every time I compile a program, or my program will not work acceptably on my system! 3) What is a 'char' anyway? Some of you called attention to K&R's discussions of the char type. K&R definitely present 'char' as system specific. a single byte, capable of holding one character in the local character set. (p. 34) Following this statement is a table in which presents the 'char' type as 8-bit ASCII on the PDP-11, 9-bit ASCII on the Honeywell 6000, 8-bit EBCDIC on the IBM 370, and 8-bit ASCII on the Interdata 8/32. On the following page is an explanation of character constants and the differing numerical values associated with '0' in ASCII and EBCDIC. My point is that K&R clearly sets forth the 'char' type as a logical quantity which is implementation specific. They are willing to include ASCII and EBCDIC in the definition, and, I assume, any other arbitrary representation scheme that will fit into "a single byte". By this definition, any code that depends on the mathematical properties of characters (e.g. that, in ASCII, A-Z and a-z are contiguous) is inherently non-portable! 4) What difference does it make? None - if we want to continue to insist that English is the official language of C and UNIX! There is, however, a market of people who want to sed with ninyas or awk with cedillas. There may, in fact, be a system just around the corner for users who want to diff in Kanji! Unfortunately all of these are out of luck, since the afore- mentioned code only works with 7-bit characters. At this point in time I am still trying to explain to my colleagues in the Humanities Computing Lab why their new $10,000 Apollo supermicro can't display a simple umlaut! I guess the point of this rave should be summarized. Now that hardware no longer restricts us to 7-bit character sets, isn't it time we see *forward* compatible compilers that default to the native character set of their host system, and isn't it time we start writing (or rewriting) portable UNIX code that will work on systems whether characters display in ASCII, EBCDIC, Swedish, or Amharic! Jeffrey William Gillette uucp: mcnc!ethos!ducall!jeff Humanities Computing Facility bitnet: DYBBUK @ TUCCVM Duke University -- Jeffrey William Gillette uucp: mcnc!ethos!ducall!jeff Humanities Computing Facility bitnet: DYBBUK @ TUCCVM Duke University