Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site fortune.UUCP Path: utzoo!watmath!clyde!burl!ulysses!mhuxl!ihnp4!fortune!rpw3 From: rpw3@fortune.UUCP Newsgroups: net.unix Subject: Re: Does C depend on ASCII? - (nf) Message-ID: <3249@fortune.UUCP> Date: Sun, 6-May-84 07:06:18 EDT Article-I.D.: fortune.3249 Posted: Sun May 6 07:06:18 1984 Date-Received: Mon, 7-May-84 00:49:30 EDT Sender: notes@fortune.UUCP Organization: Fortune Systems, Redwood City, CA Lines: 58 #R:utcsstat:-187300:fortune:26900056:000:2297 fortune!rpw3 May 6 03:13:00 1984 [ After this, let's move this to "net.lang.c", shall we? ] Many, many programs I have seen depend on certain characteristics of ASCII, but I am sure it varies by program as to how much of the total sequence is wired in. This has GOT to be a major factor in the cost of porting UNIX to a non-ASCII machine. Most of what I have seen included the at least the following hard dependencies: 1. The numbers are contiguous (no gaps). Kernighan & Ritchie [pp20-21]: "This particular program relies heavily on the properties of the character representation of digits. For example, the test if (c >= '0' && c <= '9') ... determines whether the character in "c" is a digit. If it is, the numeric value of that digit is c - '0' This works only if '0', '1', etc., are positive and in increasing order, and if there is nothing but digits between '0' and '9'. Fortunately, this is true for all conventional character sets." Note particularly the word "all" in that last sentence. Again [page 39], in the sample "atoi(s)", the same assumption is made. 2. The lowercase letters (as a class) are contiguous, as are the uppers. Some programs know that 'A' + 040 == 'a', some don't. Some only depend on 'a' > 'A' (so that 'x' - 'X' is a positive number). Interestingly, most of the programs I have seen DON'T assume any fixed distance between '9' and 'A', but when converting hexadecimal input they adjust for letters by subtracting 'A' - ('9' + 1) from the value of the letter. 3. The ASCII control characters exist, and have values of 'X' - 0100 for any control character <^X> (where 'X' is the upper-case letter of similar appearance). Is is known (for example) that newline == '\n' == 'J' - 0100, and that 'H' - 0100 is a backspace. In sum, many programs assume ASCII, or at least, certain properties of the collating sequence. The ones mentioned above are certainly not a complete list of what you may find when trying to use another character set, but they are a few "biggies". The use of "_ctype[]" can help, but many programs do not use it with consistency. Sorry 'bout that... Rob Warnock UUCP: {ihnp4,ucbvax!amd70,hpda,harpo,sri-unix,allegra}!fortune!rpw3 DDD: (415)595-8444 USPS: Fortune Systems Corp, 101 Twin Dolphin Drive, Redwood City, CA 94065