Xref: utzoo comp.emacs:4547 comp.lang.c:13731 comp.sys.ibm.pc:20772 Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!nrl-cmf!ukma!cwjcc!hal!nic.MR.NET!shamash!nis!sialis!jhereg!mark From: mark@jhereg.Jhereg.MN.ORG (Mark H. Colburn) Newsgroups: comp.emacs,comp.lang.c,comp.sys.ibm.pc Subject: Re: Programming and international character sets. Message-ID: <207@jhereg.Jhereg.MN.ORG> Date: 1 Nov 88 16:13:39 GMT References: <532@krafla.rhi.hi.is> <8804@smoke.BRL.MIL> Reply-To: mark@jhereg.MN.ORG (Mark H. Colburn) Organization: NAPS International, St. Paul, MN Lines: 121 In article <8804@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) ) writes: >In article <532@krafla.rhi.hi.is> kjartan@rhi.hi.is (Kjartan R. Gudmundsson) writes: >>How difficult is it convert american/english programs so that they can >>be used to handle foreign text? [etc.] > >Where have you been the last few years? This subject area is known as >"internationalization" and has been the featured topic of special issues >of several journals, including UNIX Review and UNIX/World. The draft >proposed ANSI/ISO C standard specifically addresses this issue (it is >one of the reasons production of the final standard was delayed). Unfortunately, the C standard is still lacking in this area. It is true that the attempt was made, however, X3J11 will have to go through another round if it is to be truly internationalized. One problem is that, althougth the standard supports multi-byte characters which are required for a number of languages around the world, especially those in Asia, no support is provided to pass those characters to any of the is...() or to...() functions. Since all the is...() and to...() functions take an integer parameter, it would be impossible to evaluate a multi-byte character. Another problem is that an application has no way of portabily determining where the current character in a string ends and the next begins; you can't just use ch++ to advance to the next character anymore. And it is even harder to move backwards though a string. There are some other problems with collation as well, some language may have several lowercase characters corresponding to a single uppercase character, or vice-versa. This presents some problems when using toupper() and tolower() to covert a character to it's opposite case. In addition in some languages and/or collation sequences there are some characters which do not have a corresponding opposite case (i.e. there is only an uppercase character with no corresponding lowercase character in a code set) To be fair, we did not uncover these deficiencies until just recently (just after we sent our ballot in for the third public review), so these may not have been issues specifically addressed by the commitee. There are some solutions to these problems, which would allow for internationalization without breaking any existing programs. Here are some suggestions: 1. Develop some functions which provide the same functionatality as the is...() functions but which take a character pointer as an argument. For example: int wcislower(char *string) 2. Develop some functions which provide the same functionality as the to...() function but which return a character pointer. Unfortunatly, these functions may need to allocate space in order for the transformation to work, or they may need to pass back a pointer to a static string which would then need to be copied. The latter is probably the way most implementations would do it since it is essentially a table lookup. For example: char *wctolower(char *string) 3. Provide some functions to allow traversing a character string. These functions would return a pointer to the next character in the string as determined by the current local. For example: char *nextchar(char *string) char *prevchar(char *string, char *backup) These last two functions were presented at the latest IEEE POSIX meeting by one of the commitee members to cope with this problem. The backup string in prevchar() provides a pointer to a known character boundry that the function can use to scan forward in the string in order to determine where the actual character boundry of the previous character is. 4. Some of the string functions would need to be revised as well, specifically strlen(). int wcstrlen(char *string) This function would return the string length of the current string according to the current locale setting. Therefore the string "abss" would give a length of 4 in the C locale, but may return 3 in a German local. The functionality of this could be put in the current strlen, however, there are still requirements to get the number of bytes in a string, as well as the number of characters, so the old strlen should not be replaced. Internationalization is a tricky and invovled problem. Unfortunately it is not possible for an existing program to recompile under and ANSI compiler and become internationalized. A number of changes to the application are required in order to provide for maximally portable code. However, it is possible to provide the internationalization without breaking any existing code. What has been discussed so far is character level internationalization, which is only one side of the fence. The other side is language translation of strings. This is known as "messaging" in the circles which talk about internationalization (let's overload yet another computer science term...). However, messaging can be accomplished by developing messaging libraries which contain the strings required by the application, translated into every language which your application needs to support. When you wish to display a string, such as "press spacebar to continue" you call the messaging library with a unique identifier which is associated with your string, and the messaging library returns a string, based on the current local, which depicts the same idea as "press space bar to continue". This also requires some fancy footwork on the part of applications, since displaying these messages is bound to be very difficult since some languages read left-to-right, some read right-to-left, and some sucn as Mongolian, do both and even go diagonally. Add string attributs such as centering and justification and character attributes such as inverse, normal and blinking and messaging becomes very interesting indeed. Internationalization is a relatively new field, and a number of things still need to be ironed out, but I think that we are making progress, and that progress should continue. -- Mark H. Colburn "They didn't understand a different kind of NAPS International smack was needed, than the back of a hand, mark@jhereg.mn.org something else was always needed."