Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!zaphod.mps.ohio-state.edu!think.com!yale!cmcl2!adm!smoke!gwyn From: gwyn@smoke.brl.mil (Doug Gwyn) Newsgroups: comp.std.c Subject: Re: How to write Trigraph like character sequences in a string Message-ID: <16339@smoke.brl.mil> Date: 5 Jun 91 08:16:38 GMT References: <1991Jun3.011539.17430@tkou02.enet.dec.com> <16332@smoke.brl.mil> <1991Jun5.005958.9597@tkou02.enet.dec.com> Organization: U.S. Army Ballistic Research Laboratory, APG, MD. Lines: 72 In article <1991Jun5.005958.9597@tkou02.enet.dec.com> diamond@jit533.enet@tkou02.enet.dec.com (Norman Diamond) writes: >Does this mean that in a national character set that doesn't have [|\ etc., >and a proposed implementation accepts the entire national character set >plus trigraphs, failure to support [|\ etc. through direct means without >trigraphs will make the implementation non-conforming? The C source character set, in terms of the standard, has no necessary relation to the code set used for "the national character set", although in many current implementations there happens to be a close relationship. The standard says the specified set of C source characters, which are the result of conversion from external representations, must be supported, but it doesn't specify details of the conversion. This is independent of the trigraph mapping that occurs at a slightly phase of translation. You might recall the "Software Tools" Ratfor translator implementation; it converted external source file characters, which might for example be coded in EBCDIC, to a universal internal form (which happened to be ASCII in that example), to take advantage of the known properties of the internal representation (e.g. contiguity of the alphabetic character codes) for fast processing, and converted back to external characters upon output (it was a text-to-text format translator, not a true compiler). C compiler implementations [that don't need to provide support for Japanese-style multibyte character encodings in C source files] could readily map whatever the site-specific conventions are for external representation of funny characters (traditionally displayed as vertical-bar glyph, etc.) used in C source code to internal, more convenient (probably 7-bit code) form for subsequent processing. Such conventions are not the business of any C standard; they necessarily depend on highly site-dependent characteristics. >I thought that the exact opposite of this was previously decided, that >trigraphs were sufficient in such cases. No, and trigraphs are a horrible invention that seems to mainly serve to lead people in the wrong direction for coping with character set problems. While trigraphs may at first appear to "solve" the code set issue by permitting translation of any strictly conforming C source to a "lowest common demoninator" set of characters that even ISO 646 sites claim to support, in order to accomplish this function one needs to come up with utilities for translating generic C into full trigraphed form, as well as the inter-code set text file tranfer and translation facilities that are always necessary for data interchange. The latter have to solve the differing-code-set issues anyway. (Note that there are radically different code sets among the sites that receive this newsgroup; to a large extent similar problems have already been overcome in the development of internetworking.) If I may add a general observation about code set issues, particularly multibyte encodings: It seems to me that the people designing software facilities, hardware, and standards concerning the issues generally fail to appreciate a crucial design point: The sooner you can map everything into a uniform format with simple, clean, properties, the better off you are. Instead, we keep seeing designs that require the users of the services to face algorithmic complexity, because the data being operated upon has been left in a complex encoded form instead of being turned into the previously mentioned uniform format with nice properties. Algorithms naturally reflect the underlying structure of the data. If you'd like to be able to code programs that deal with text in a simple manner, as seen in early UNIX utilities such as "wc", you need to keep the form in which text is seen by program code as simple as possible; for example, all text characters must be handled as one "character" type, a complete unit of which would be returned per call to getchar(), obviating the need for wchar_t and the (rapidly growing) library of functions for helping applications deal with nonunitized, fragmented, and stateful characters. This would mean that in some environments a 16-bit datum would be required for representing a single character, but we ended up with that anyway, in the form of wchar_t, without the benefits of a simple program interface to text units. Since the character problems have to do with people, their details should be pushed as far out from the application (thus as close to the users) as possible. I think X3J11, prompted by certain vendors who were already committed to complicated solutions, made the wrong choice here. I would hope that other computer engineers learn from this example how not to "solve" such problems.