Xref: utzoo comp.std.c:1068 comp.std.internat:488 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!purdue!haven!adm!smoke!gwyn From: gwyn@smoke.BRL.MIL (Doug Gwyn) Newsgroups: comp.std.c,comp.std.internat Subject: Re: Hex escape for quoted multibyte character Keywords: ANSI, hexadecimal, escape, multibyte, wide character, quote Message-ID: <10125@smoke.BRL.MIL> Date: 26 Apr 89 07:57:09 GMT References: <101058@sun.Eng.Sun.COM> Reply-To: gwyn@brl.arpa (Doug Gwyn) Organization: Ballistic Research Lab (BRL), APG, MD. Lines: 43 In article <101058@sun.Eng.Sun.COM> kuro%shochu@Sun.COM (Teruhiko Kurosaka - Sun Intercon) writes: > char *the_multibyte_char="\x8eabcd"; /* I-1 */ No, other than the null-byte terminator there is just one char in that string. Its value is implementation-dependent but is very likely either 0x8E or 0xCD. >However, I noticed, the draft sometimes use the word "character" and >"byte" interexchangably. It always uses these terms interchangeably; the difference is merely one of emphasis. See their definitions in section 1.6. Note also that "multibyte character" is defined as a separate concept, and that the occurrence of the word "character" in the phrase "multibyte character" is not covered by the definition given for just "character". This is an unfortunate property of technical English, and perhaps we should have invented some other name for "multibyte character", but nobody could think of an acceptable alternative. > char *the_multibyte_char="\x8e\xab\xcd"; /* I-2 */ Correct. You could also simply place the Kanji or whatever character directly between the " marks, although that would make your source code less portable, since different implementations would interpret the bytes in your multibyte source character in different ways, some of them perhaps invalid syntactically. (For example, one of the bytes might represent the " mark in some other implementation.) > wchar_t *the_wide_char_str=L"\xbcde"; /* II-1 */ Correct. > whcar_t *the_wide_char_str=L"\xbc\xde"; /* II-2 */ [ wchar_t] No, this string contains three distinct values: 0x00BC, 0x00DE, and 0x0000. > whcar_t the_wide_char=L'\xbcde'; /* III-1 */ [ wchar_t] Correct, assuming you fix the typographic error as indicated. >My personal choices are I-2, II-I and III-1. The Standard agrees with you (or vice versa).