Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!cs.utexas.edu!uunet!drivax!marking From: marking@drivax.UUCP (M.Marking) Newsgroups: comp.std.c Subject: Re: Multibyte characters Message-ID: Date: 5 Jul 90 03:26:37 GMT References: <1467@inset.UUCP> Sender: marking@drivax.UUCP Reply-To: marking@drivax.UUCP Organization: Digital Research (Japan) Inc. Lines: 60 mikeb@inset.UUCP (Mike Banahan) writes: ) Let's say that I do have a multibyte execution character set which supports ) for the sake of argument, English and Greek, with Greek using a shift-in ) shift-out mechanism. ) A string of the form "abc@d" is valid C (using @ to represent the Greek ) character `alpha'. ) It will contain 8 bytes, counting the shift-in, shift-out and the null ) at the end. ) Presumably the integral constant '@' is a three-byte constant, no matter ) what it may look like? I don't know about Greek, but I have seen situations where the mbchar itself is three bytes, so with the shift in/out you have five bytes. Not all schemes use shift/in shift out: some don't know about shifts at all and some have an implicit shift after each character, so it's *always* in the initial state. For others, the shift is implied by the initial character of the multibyte sequence being in certain ranges. Furthermore, some schemes use characters of mixed lengths, so that a string might consist of a mixture of 1, 2, and 3-byte characters. (My apologies if you want to know about Greek specifically, but my presumption is that we want to write code that will work in a variety of locales.) ) An alternative interpretation is that it violates ) the constraint in 2.2.1.2 `a .. character constant .. shall begin ) and end in the initial shift state', but presumably I can expect my ) implementation to do the necessary good deeds and put a shift-out ) in there too. Good question. In Japanese, there are no separate shift characters, so I don't know what compilers do when there are. Anyone? ) Since it is a three-byte constant (assuming I'm right), then can I be ) sure that I do not get overflow when I assign it to a char variable? A char is not a multibyte char, so truncation or overflow or whatever is the likely result. The type char is still a single scalar value, so an array of them is needed for multibyte data. ) 3.1.3.4 says that the value of a multi-character character constant ) will be implementation-defined, and 3.2.1.2 says that that (paraphrase) ) demoting an int to a char gives an implementation-defined result. ) So to call it `overflow' is perhaps overstating the case, but I clearly ) end up in implementation-defined territory twice over. You can test MB_LEN_MAX (for the compiler's worst case) or MB_CUR_MAX (for the current locale's worst case) to check how many bytes you might need to hold the value. My question: do MB_LEN_MAX and MB_CUR_MAX include shift characters in locales that use them? If not, my recollection is that the old ansii spec on extended characters allows multibyte shift sequences, so how do we know the maximum length of a shift sequence (in or out)? My experiences with shift characters antedate the introduction of multibyte and wide characters into C. Any information on current use of shift characters here would be appreciated.