Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!elroy.jpl.nasa.gov!ncar!hsdndev!cmcl2!adm!smoke!gwyn From: gwyn@smoke.brl.mil (Doug Gwyn) Newsgroups: comp.std.c Subject: Re: wchar_t values Message-ID: <15651@smoke.brl.mil> Date: 1 Apr 91 06:45:30 GMT References: <1006@sranha.sra.co.jp> Organization: U.S. Army Ballistic Research Laboratory, APG, MD. Lines: 166 In article keld@login.dkuug.dk (Keld J|rn Simonsen) writes: >erik@srava.sra.co.jp (Erik M. van der Poel) writes: >> C ISO DIS 10646/4 wchar_t >> L'c' 032/032/032/099 000/000/000/099 >> L'\t' 009/128/128/128 000/000/000/009 >Erik writes: ANSI C does not handle 10646 properly -> let's change 10646! No, he didn't say that, and his suggestion seemed reasonable to me. >ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 >correctly. So ANSI C multibyte specifications *cannot* be used on any >multibyte de jure character set. I think you have mixed multibyte character sequences with wchar_t. They are NOT the same thing! That is why there are interconversion functions specified in the C standard. The advice X3J11 received during development of this aspect of the standard, from such organizations as NTSCJ who have a major stake in so-called multibyte character encodings, was that the mechanisms in the C standard were adequate for this purpose. Unless you can explain WHAT it is that you think is wrong, I suggest that your comments be ignored. I haven't seen a significant technical argument against the "wide character" mechanisms in the C standard; what I have seen are misunderstandings. Perhaps you should refer to P.J. Plauger's model standard C library implementation in his new book, to see what is actually involved in exploiting and implementing these facilities. >Also the character standards should be the base standards and >programming language standards build on these and provide appropiate >functionality to cover the standard character sets. WHICH "character standard"? There are so many to choose from, all of them botched in one way or another. That is why programming language standards should be INDEPENDENT of any particular choice of character code set, rather than based on one choice that may not be appropriate for many of the potential users of the language. In the case of C, the only requirements on the basic source and execution character sets are that there be at least 96 distinct values, that the values assigned by the C implementation to represent digit glyphs be a contiguous ascending sequence, that there be three additional distinct values in the execution set, and that all the previously mentioned internal values be distinct from zero. THE MAPPING BETWEEN INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN, AND SHOULD, BE DEFINED BY THE C IMPLEMENTATION. Thus, a straight 6-bit external code set could not have a one-to-one correspondence between external "characters" and C source OR execution "character" values, and in such a system environment there would have to be at least one added convention for representing the full set of internal C characters, with support tools to facilitate working with such special text files. Mapping is an extremely important mathematical concept, with particular relevance to applications involving multiple alphabets. (As a former cryptanalyst I am especially sensitive to this.) That is why the VERY FIRST STEP in translating a C program, as spelled out in the C standard ("Translation Phases", section 2.1.1.2 in X3.159-1989), is the application of a MAPPING from physical (i.e. external) source file characters to (internal) C source characters. Systems with record- oriented text files can exploit this mapping subphase to introduce line delimiter internal characters ("new-line" characters in the C source character set), and systems that lack standard representations for some of the required C source characters can take advantage of this mapping subphase to interpret, for example, digraph representations for the characters not normally considered to be represented in the "native" code set. This is a simple and clean approach to satisfying the C source character set requirements. Indeed, X3J11 explained this to you in the third public review response document. Judging from your continued pursuit of more obtrusive solutions for your own particular limited character set problem, it would appear that you either did not understand the X3J11 response or that, for reasons of your own, you wish to ignore it. For purposes of documentation for those who have not seen the response X3J11 gave long ago, here it is: In response to Letter #177, Doc. No. X3J11/88-134: Summary of issue: Proposal for more readable supplement to trigraphs. X3J11 response: The Committee discussed this proposal but decided against it. We cannot support this proposal for a number of reasons. Trigraphs were intended to provide a universally portable format for the *transmission* of C programs; they were never intended to be used for day-to-day reading and writing of programs. Should it be necessary to do so, however, the preprocessor can already be used to improve their readability (exact macro names and definitions are not provided as the Committee prefers to avoid stylistic issues). As larger character sets become more and more popular, the chances of having to deal with a "deficient" character set become smaller and smaller. Conversion between the current trigraph representations and "normal" representations can be done simply in a context- free manner, but this is not possible with the proposed notation. Also, there are a number of difficulties with the infix subscript operator where empty brackets would have been used. Either the operator must be allowed as a postfix unary operator as well as a binary operator, or the grammar must be extended to allow empty parentheses to appear in those contexts where empty brackets can. Although these problems are by no means insurmountable, we feel that the current trigraphs are adequate for their intended use and that no further enhancements are necessary. Translation phase 1 actually consists of two parts, first the mapping (about which we say very little) from the external host character set to the C source character set, then the replacement of C source trigraph sequences with single C source characters. (Note that the C source characters represented in our documents in Courier font need not appear graphically the same in the host environment, although a reasonable implementation will make them as nearly so as possible.) The kind of mapping you propose can in fact be done in the first part of translation phase 1, and several such "convenience" mappings are already common practice. However, attempting to standardize this mapping is outside the scope of the C Standard, since what is appropriate may depend on the capabilities of the specific hardware, availability of fonts, and so forth. Although the Committee regrets any "no" votes on either the national or international proposed standards, we feel we must represent our best judgement on technical issues. We hope you will reconsider your objection to the current specification. Note that your "trigraph alternative" proposals had been discussed many times in the standards committees, and still were resoundingly defeated during a joint X3J11/WG14 meeting. The only reason this issue is still "on the table" for WG14 is that there was some political maneuvering at the SC22 level in the absence of anybody who could represent the actual issues and history, and SC22 mistakenly thought, on the basis of your argumentation, that there was a problem that needed to be solved, and thus directed that work toward a normative addendum to the ISO C standard begin to address this "problem". Later, the Japanese in particular thought that it would be appropriate to add more support for multibyte character sequences to the ISO C standard as part of this normative addendum. Your original hobby horse had nothing to do with multibyte character sequences, and so far as I can determine, the Japanese have not found any problem with them other than the desire for more standard library functions to make their use more convenient. It is also worth noting that there is continuing discussion of this issue on the X3J11 and WG14 electronic mailing lists. >I hope that this problem will be a historical one with the appearance >of 10646. Surely you should be able to see the possible problems with 10646? The very idea of using 32 bits to represent a character is bound to meet stiff opposition, particularly from users of small systems, who already have more efficient solutions to the "problem" of a diversity of alphabets. It seems to me that 10646 is one of the technically worst character-set standards yet to be adopted. No wonder there has been renewed interest in other standards such as "Unicode" (about which I know little at present other than that it has a broad base of industry support). One does not solve "people problems" by simply adopting a technical standard. History provides much evidence of that. DISCLAIMER: None of the above should be construed as an official X3J11 position, not even the attempt to cite from an X3J11 document. However, I believe that I have correctly represented the situation as I understand it.