Xref: utzoo comp.emacs:4527 comp.lang.c:13691 comp.sys.ibm.pc:20726 Path: utzoo!yunexus!geac!syntron!jtsv16!uunet!mcvax!hafro!krafla!kjartan From: kjartan@rhi.hi.is (Kjartan R. Gudmundsson) Newsgroups: comp.emacs,comp.lang.c,comp.sys.ibm.pc Subject: Programming and international character sets. Keywords: 8 bit characters Message-ID: <532@krafla.rhi.hi.is> Date: 28 Oct 88 00:27:38 GMT Article-I.D.: krafla.532 Organization: University of Iceland Lines: 81 How difficult is it convert american/english programs so that they can be used to handle foreign text? The answer of course depends on the language one has in mind. In Europe most nations ues the Latin alfabet and english is one of them. Unfortunately english uses very few charaters compered to other european languages, therefore the code set that is widely used by americans and english, the ASCII character set, only defines 128 characters. It is a 7 bit character set. In other european countries than England the ASCII character set is also widely used but with extension. The character set is 8 bit thus allowing 256 characters. The problem is however that the extension is not standard. We have one possability in the IBM-PC character set, other one from HP called Roman-8, DEC gives us DEC-multinational character set and the Macintosh has yet another. So if we have a program that for example converts lower case letters to uppercase, it has to be coded diffrently for each character set. Let's look at some code from MicroEMACS: input.org: if (c>=0x00 && c<=0x1F) input.org: if (c>=0x00 && c<=0x1F) /* C0 control -> C- */ main.org: case 'a': /* process error file */ main.org: if ((c>=0x20 && c<=0xFF)) { /* Self inserting. */ random.org: if (*scan >= 'a' && *scan <= 'z') random.org: else if (c<0x20 || c==0x7F) random.org: else if (c<0x20 || c==0x7F) region.org: lputc(linep, loffs, c+'a'-'A'); region.org: lputc(linep, loffs, c-'a'+'A'); region.org: if (c>='a' && c<='z') search.org: else if (c < 0x20 || c == 0x7f) /* control character */ word.org: c += 'a'-'A'; word.org: c += 'a'-'A'; word.org: c -= 'a'-'A'; word.org: c -= 'a'-'A'; word.org: if (c>='a' && c<='z') { word.org: if (c>='a' && c<='z') { word.org: wordflag = ((ch >= 'a' && ch <= 'z') || word.org: if (c>='a' && c<='z') Ugly isn't it? An other way of doing this is using "is.." functions that are defined in ctype.h, include file that comes with (almost) all c-compilers Some of the above lines would look like this: basic.c: else if (iscntrl(c)) display.c: if (iscntrl(c)) display.c: } else if (iscntrl(c)) { eval.c: *sp = tolower(*sp); eval.c: *sp = toupper(*sp); eval.c: if (islower(*sp) ) fileio.c: if (iscntrl( fn[tel++] ) ) input.c: if (iscntrl(buf[--cpos]) ) { input.c: if (iscntrl(buf[--cpos])) { input.c: c = toupper(c); input.c: c = toupper(c); /*Force to upper */ input.c: if ( islower(c) && ( SPEC != (SPEC & c) )) input.c: if (iscntrl(c) ) /* control key */ input.c: if (iscntrl(c) ) /* control key */ input.c: if (iscntrl(c) ) /* control key? */ This code is better (most of the is.. things are macros that mask the argument and return the binary mask that is either zero or positve) has more style to it and is easiear to port to a diffrent character set. An other bad habit of american programmers is this: character_value = (character_value & 0x7F ) don't do this!! If you must, you can use 0xFF insted: character_value = (character_value & 0xFF ) (Unless of course your machine breaks to peaces if it gets an 8 bit character in its io channels.) ############################################################################### # # # Kjartan R. Gudmundsson # # Raudalaek 12 # # 105 Reykjavik # Internet: kjartan@rhi.hi.is # # # uucp: ...mcvax!hafro!rhi!kjartan # # # # ###############################################################################