Path: utzoo!attcan!uunet!husc6!uwvax!umn-d-ub!nic.MR.NET!shamash!nis!ems!srcsip!colburn From: colburn@src.honeywell.COM (Mark H. Colburn) Newsgroups: comp.lang.c Subject: Re: Programming and international character sets. Message-ID: <11481@srcsip.UUCP> Date: 7 Nov 88 19:53:01 GMT References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> <207@jhereg.Jhereg.MN.ORG> <427@sdrc.UUCP> Reply-To: colburn@sip7.UUCP (Mark H. Colburn) Organization: Honeywell Systems & Research Center, Camden, MN Lines: 42 In article <427@sdrc.UUCP> scjones@sdrc.UUCP (Larry Jones) writes: >You seem to have missed a key point in the internationalization >stuff - you don't use multi-byte characters directly, you convert >them into wchar_t's using the functions in sections 4.10.7 and >4.10.8. wchar_t is an integral type (probably short or int) that >is large enough to hold ANY character value. This is not always true, although it would make things much easier if it were. You see, there is not way to take a converted string given back to you by strxform() back to it's native form. What that means is that there is no way to make modifications to multi-byte strings. This would be a serious deficiency (and the one which I was attempting to address in my last article). Strxform is only good for reading stringss, not writing them. For example, how would you do a regular expression replacment if you do not know where the next character is. What if you need to parse a string and need to know what the data in the string is? Strxform translates characters into an implementation defined format. That means that there is now way to portably do anything with the generated string, other than compare it to another string... [ description of wchar_t types...] >You can also pass them to the is*() and to*() functions provided >you've setlocale() to a locale that supports additional >characters. If you look at sections 4.3 and 4.4, you will see >that they are all locale dependent. You can NOT pass a wchar_t type to is*() functions, at least not portably. The is*() functions and to*() functions are defined as: int toupper(int c); There is no guarentee that the width of a wchar_t is less-than-or-equal-to and integer, or that it is able to be represented as an integer. As a matter of fact, in the (draft) C standard and the POSIX standards and drafts, there are hints that it may by at least 4 characters wide. One of the bugs which I pointed out, was that the draft C standard does indeed say that the is*() and to*() functions are locale dependant, but I see no way that they can be truely locale-dependant when the are defined as they are.