Path: utzoo!attcan!uunet!mcvax!diku!thorinn From: thorinn@diku.dk (Lars Henrik Mathiesen) Newsgroups: comp.lang.c Subject: Re: Re: trigraphs in X3J11 Message-ID: <3853@diku.dk> Date: 28 May 88 15:09:37 GMT References: <1988May25.212902.1904@utzoo.uucp> <5215@ico.ISC.COM>, <10949@apple.Apple.Com> <3655@pasteur.Berkeley.Edu> Organization: DIKU, U of Copenhagen, DK Lines: 82 In article <3655@pasteur.Berkeley.Edu> faustus@ic.Berkeley.EDU (Wayne A. Christopher) writes: >Nobody has said what the existing practice is with regard to European >character sets. I posted an article the other day, but it maybe it didn't get past mcvax. I shall include it here. >I think trigraphs are a trick of American terminal manufacturers who >want to fool Europeans into thinking they can use their terminals for >writing programs. Think again: If we use American ASCII-only terminals on an operating system and compiler designed for ASCII, as most of them are, there's no problem in writing C code, only in getting our national characters in the output. I think a similar confusion may be part of the reason why trigraphs are so badly concieved. My prior article follows; I apologize if it's been seen before, but I haven't seen any signs that it has. As one who regularly uses a non-ASCII terminal setup, I'd better explain a little. In Danish (my native language) we have three `extra' letters which we much prefer to use when writing Danish text - it is possible to get by with two-letter replacements, but it's not very readable. By the way, these are not `accented letters;' they are separate letters of the alphabet, with their own place at the end of the sorting sequence. Much the same applies to German, Swedish, Norwegian, and many other European languages. That's not usually a problem as most modern terminals have provisions for various national character sets, which are defined in an ISO standard. This standard allows the glyphs at some eight or ten positions to vary, including @, $, [, \, ], {, | and }. The latter six are used for the non- ASCII letters in Danish, as they follow the other letters nicely. So, the X3J11 people think, the poor Europeans can't use ASCII: we'll have to invent some kludge to bring C to their benighted shores. The only excuse for inventing something so horrible is that it only breaks a very few programs, and that it won't be used anyway. You see, over here we get by just fine without trigraphs. The less fortunate are stuck with a national character set, and have to put up with seeing the various punctuation as letters - they are not as visually distinctive (and the brackets and braces don't pair naturally), but with a little attention to layout one gets by quite well. And it's _much_ better than trigraphs. The lucky ones have terminals which can switch between ASCII and national character sets. If not for the warped minds of the terminal manufacturers, this would be the perfect solution. But we (at this institute) have yet to see a terminal with an escape sequence to switch character sets, or (and this is worse) one whose keyboard layout did _not_ change with the character set shown on the screen. (And none of them had LCD keytops). So we have to pay the importer to hack new PROMs to enable us to switch without moving the keys around. But I digress. By the way, I find that it's easier to read Danish with ASCII characters than it is to parse convoluted C code in Danish characters, so I hardly ever bother to switch any more. To make it pleasant to use C and national letters in the same file, there would have to be _convenient_ replacements for the ASCII characters in question, and it would have to allow the national letters to be used in identifiers (trigraphs don't). This cannot be done as an extension of the ASCII C input format because the national letters are punctuation in ASCII. Now we're talking about an alternate input format for C - we'll have to tell the compiler if a given source file is in the `old' or the `new' format. On the other hand this frees us to use extra keywords etc. The new format shouldn't use any characters that may be replaced in national character sets. The tokens [ ] { } | || (and in some compilers |=) must be replaced; one off-the-cuff possibility is (. .) beg end or cor (or=). We need a new pre-processor escape and a new string escape, which can't very well be keywords. // might be a possibility for both, as it's rare in C, but does it look too much like JCL? This new format could probably be implemented by a little lex pre-pre- processor; national characters in identifiers would have to be encoded somehow (e.g. using Q as an escape), increasing the identifier length. This would cause problems with symbolic debuggers and short-name compilers, but could easily be retrofitted on old compilers (write your own cc ...). Oh well, it wouldn't be portable anyway. Hey, anybody from GNU reading this? By the way, Standard Pascal is designed to be possible to write without specific ASCII characters: It allows (. .) for [ ] (indexing), and (* *) for { } (comments). Since e.g. .5 is a legal constant, this may cause unexpected parse errors for programmers who're unaware of the feature. -- Lars Mathiesen, DIKU, U of Copenhagen, Denmark [uunet!]mcvax!diku!thorinn Institute of Datalogy -- we're scientists, not engineers.