Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!usc!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!dkuug!dkuugin!keld
From: keld@login.dkuug.dk (Keld J|rn Simonsen)
Newsgroups: comp.std.c
Subject: Re: wchar_t values
Message-ID: <keld.670436534@dkuugin>
Date: 31 Mar 91 16:22:14 GMT
References: <990@sranha.sra.co.jp> <keld.670360834@dkuugin> <1006@sranha.sra.co.jp>
Sender: news@slyrf.dkuug.dk
Lines: 66

erik@srava.sra.co.jp (Erik M. van der Poel) writes:

>As several people have guessed, the real reason for bringing up the
>wchar_t issue is because I am wondering how ISO 10646 can be used in
>the C language. Personally, I think that we should use it as follows:

>	C	ISO DIS 10646/4		wchar_t

>	L'c'	032/032/032/099		000/000/000/099
>	L'\t'	009/128/128/128		000/000/000/009

>I think that this is the most reasonable way to do it since it seems
>to conform to ANSI C.

Erik writes: ANSI C does not handle 10646 properly -> let's change 10646!
I do not think this is the right way of reasoning.

ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
correctly. So ANSI C multibyte specifications *cannot* be used on any
multibyte de jure character set. Seems to me to be a fault with
ANSI C. Also the character standards should be the base standards and
programming language standards build on these and provide appropiate
functionality to cover the standard character sets. 

If then another programming language or maybe some communication
standard have other requirements for a universal character standard,
should character standard then also be changed to accomodate that use?
And what if the different requirements are contradictionary, should that
lead to different character set standards? Well, that was what happened
in the past, with the ISO 646 and 8859 standards in programming languages
and 6937/T.61 in the communications world. I hope that this problem
will be a historical one with the appearance of 10646.

>However, I don't really care what encoding we use for wchar_t, as long
>as implementors who wish to use 10646 for wchar_t all agree on one
>encoding. So we should create an international standard the specifies
>how to use 10646 as a processing code in C. If this spec appears some
>time after 10646 becomes an IS, implementors might do things
>differently. So the spec should appear together with 10646. Perhaps in
>a normative annex in 10646?

It could also appear in the ISO C addendum that is being worked on
by WG14. I think that is the most natural place, 10646 should not
as a base standard for other JTC1 work reference the ISO C standard.

I have some ideas on how to solve it in C:

1. include a table for mapping ASCII characters into the current
execution character set in the runtime library. This table is
changed with a new call to setlocale(). L'c' then points to the
table entry of ASCII 'c' with the current wchar_t 'c' value.
Effectivenes: quite good, just a pointed value instead of an immediate
value. For widechar characters this may even be without any loss
as the widechar value may have to be stored  in a 2 or 4-byte location
anyway.

2. Have a function which returns a character from a charmap name
(POSIX term). This will have the generality that not only ASCII
characters can be handled in this way. Say a character <c,> (C-cedille)
can also be tested on in this way. 
Effectivenes: less good,  needs a function call and a table lookup
on a name (hashed or the like).

Maybe we should have both ways of  handling the identity of widechars.

Keld Simonsen