Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!apple!bionet!agate!ucbvax!bloom-beacon!eru!hagbard!sunic!dkuug!dkuugin!keld
From: keld@login.dkuug.dk (Keld J|rn Simonsen)
Newsgroups: comp.std.c
Subject: Re: wchar_t values
Message-ID: <keld.670868616@dkuugin>
Date: 5 Apr 91 16:23:36 GMT
References: <keld.670360834@dkuugin> <1006@sranha.sra.co.jp> 	<keld.670436534@dkuugin> <15651@smoke.brl.mil> <keld.670719584@dkuugin> <HARKCOM.91Apr5091146@spinach.pa.yokogawa.co.jp>
Sender: news@slyrf.dkuug.dk
Lines: 65

harkcom@spinach.pa.yokogawa.co.jp writes:

>In article <keld.670719584@dkuugin> keld@login.dkuug.dk
>   (Keld J|rn Simonsen) writes:

> =}JIS X 0208 (basic Japanese 16-bit standard)  /035/099

>   JIS X 0208 doesn't cover the ASCII characters. It has a double
>sized (zenkaku) English character set though. 'c' in all three of
>the popular multibyte encodings (EUC, JIS, SJIS) is 0x63 (same as
>ASCII). The most common wide character format (UJIS) has 'c' as
>0x0063 (ASCII in 2 bytes).

I understand what Al is saying, that the row 2 in the Japanese, Chinese
and Korean basic 16-bit character sets, which all contains what to
me looks like complete ASCII, is in fact not ASCII, but double-sized
English characters. When doing coding, at least in Japan, the programmer
usually combine the 16-bit character set with ASCII in an encoding
which is 8/16 bits (or 7/14 bits).

(Now I do not have great luck in saying what I think other people mean:-(

>   I don't know the encodings for the Chinese & Korean well, but the
>standards don't seem to cover 'c'...

I have my information from the ECMA registry of character sets,
and I really doubt that these informations are incorrect or
that I have misread them.

> =}None of these values have the nice property of having ASCII 'c'
> =}extend into these values when loading as a 16-bit or 32-bit int.

>   See above...

My points still hold. You could have troubles handeling widechar
characters in clean 16-bit de jure standards. Apparantly people
out there don't program widechars in these character sets (true 16-bit),
But always combine with other character sets.

> =}think there is a problem
> =}and they have not yet been able to solve it.

>   A problem with ISO 10646? A problem with the 'East-asian de jure'
>character sets in reference to wchar_t? 

WG14 has got a letter from SC2 pointing out an apparant problem with
10646, that the characters in the C repertoire in 10646 canonical form
was different from a sign-extended single-byte character. I have been
actioned by WG14 to respond to SC2.

>   Your apparent knowledge of the JIS standard shows you have little
>room to point...
 
Well, my knowledge can always be improved. Still the facts I have
represented on 16-bit character sets are true. They may be
irrelevant as the usage is done in combination with other character sets 
in an encoding. And the whole problem with 10646 (and other multibyte
character sets) usage in widechar strings may be non-existing.
I really hope there is no problem, then we do not need to make
changes anywhere. But we should write some explanations on how this
is supposed to function, as quite some people have had problems
with this. I think the best place to write such interpretations is in
the forthcoming ISO C addendum.

Keld Simonsen