Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!mcvax!kth!sunic!enea!sommar From: sommar@enea.se (Erland Sommarskog) Newsgroups: comp.lang.eiffel Subject: Eiffel and national character sets Message-ID: <101@enea.se> Date: 8 Jul 89 20:40:27 GMT Organization: Enea Data AB, Sweden Lines: 188 In one of his many interesting articles Bertrand Meyer said if one wants changes in Eiffel the next six months is the time to ask for them. Since I think Eiffel has to be improved with regards to supporting human langauges other English, I will try to summarize the potential problem areas I'm aware of. I'm not an expert in neither Eiffel nor character-set standards, so I will not try to provide complete solutions, but merely point out the requirements. Before I go any further, let me add that these problems are in no way unique to Eiffel. Since some time ISO requires support of multi- national character sets when they are approving or revising a language standard. The problems connected with this vary from langauge to langauge, as do the solutions. I first give an overview of existing character sets. I then discuss the various problem areas: Operator and delimiter characters, literals, identifiers and finally operations as comparisons. Standardized character sets --------------------------- There are several character-set standards, which I briefly describe since not everyone may be well acquinted with them. ASCII - The standard on which about everything else is based. EBCDIC - Appears in some worlds, but none I have experience of. ISO 646 - A seven-bit standard, where some characters in the ASCII are replaced by national characters. Which are the replace- ments depends on the country you're in. For instance in Sweden left brace is replaced with dotted "a", in France it's an "e" with accent aigue. ISO 2022 - A standard that describe how to change between different character sets. ISO 6937 - A eight-bit set, which doesn't seem to have been adopted very much. Slots 0-127 are ASCII. From 161 and up are mute modifiers and national letters. With 6937 dotted "a" is produced by first given a diaresis and then the "a". Virtually all langauges with a Latin alphabet, except Vietnamese, could be written with this set. ISO 8859 - Nine standards which all have ASCII in 0-127 and control characters in 128-159. Then the contents varies depends on the geographical are addressed. Five of the sets has Latin characters, then there's one each for Cyrillic, Hebrew, Arabic and Greek. I don't know whether all nine has finally been settled. Some may still be drafts. One could expect that the absolutely most commonly supported will be 8859/1, also known as Latin-1. Latin-1 covers most of the languages in Western Europe. (Exlcuded are Welsh and Catalan.) 8859 has no mute characters. ISO 10646 - A multi-octet character set, which is under development. I know very little of it, but I doubt there is even a draft of it. There was a posting about it in comp.std.internat some time ago. In the following I will conentrate on ISO 646 and 8859. Although I personally am appealed by the ideas in 6937, its use in real life is small, so I'm disregarding the problems that supporting this standard would cause. Operator and delimiter characters --------------------------------- With an eight-bit set based on ASCII there are rarely any problem. However, in a seven-bit world there are. Any programming language using any of the characters @[\]^`{|}~ as an operator or a delimiter is committing a crime in my eyes. Most of the national sets that ISO 646 defines replaces these characters with national characters, and in many cases these characters are letters. So in my eyes a notation like class BIN_TREE [T] is just as bad as: class BIN_TREE ZTQ (Read Z and Q as opening and closing delimiters!) Many languages that use these characters alleviates the problem by providing alternative tokens. For instance, Ada allows you use "!" for "|". Many Pascal compilers allow "(." and ".)" for [], and (* and *) are more common than {}. Eiffel is a sinner in this field. With Dr. Meyer's origin in mind, I assume he is not unaware of the problems, but has chosen to ignore them. Still I hope he will re-think and change his mind on this issue. Letters as special characters is simply not a good idea. One could argue that since we're moving into a eight-bit world, this is a disappearing problem, but remember that that transition is slow. We will live with seven-bits terminals and printers quite a long time still. Now, what actual problems do we have in Eiffel? The occurance of brackets and braces in Eiffel is restricted to the class declaration and the export clause which gives less pain than if they could occur anywhere. Anyway, finding replacement characters should be easy. (To be honest I don't really see why they had been chosen in the first place. Is there some lexical problem that prevents simple parenthesis?) Worse is the backslash. Could you think of having to double all "W"s in your string literals? Probably not, so you wouldn't pick "W" as the escape character. Eiffel has chosen the bad habit from C of using dotted "O" (which is how the backslash appears on my screen). Here I not only want an alternate character, but also I want to get rid of the original. (As a whole, I am not fond of the C style of writing character and string literals, why use octal codes?) Literals -------- Which characters can I use in string and character literals? If we forget the fatal backslash, Eiffel doesn't give me any problem if I'm using any of the 8859 standards. It is just to go ahead and use them. (At least that is what its description alludes. For what happens in real life, see an adjacent article of mine.) Other people will get problems, though, mainly Japanse and Chinese programmers. I.e., there is no support for multi-byte sets. As a side note, a language which really is evil here is Ada. Ada explictly forbids non-printing charcaters in literals, and "non- printing" is defined from ASCII, so using the upper half of Latin-1 in Ada is a real pain. Ada-9X will resolve this, but that's another three or four years from now. Sigh. Identifiers ----------- Eiffel, as most other langauges, allows the letters in the English alphabet in identifiers. However, if you're writing code in your native langauge, you may need to use other letters as well. To be able to use the replacement characters in ISO 646 would be nice, but it would be pointless to require that. But with eight-bit sets in 8859, it is a fair requirement that all letters in these sets also are permitted in identifier names. The problem lies in the difference between the sets. In Latin-1 161-191 are punctuation characters which you normally wouldn't think of in identifier names. 192-255 are letters, with 32 as difference between lower and upper case. (A few exceptions which I disregard here for brevity.) In the other Latin sets, some characters in the range 161-191 are also letters, with 16 as the case difference. How the non-Latin sets look like I don't know. One could make this a very simple issue and just take Latin-1, with the motivation that is what will be used in the known world of computing. However, I think this would be fatal mistake. Should our friends in Hungary, Russia and Greece be handicapped in the selection of identifier characters? Do we know that "the known world of computing" will forever restrict itself to to places were Western European langauges are spoken? Now then, how to support mulitple sets? An idea would be to have a directive that said which character set the source code was written with. We must of course immediately discard the idea, since this is impossible in a modular langauge like Eiffel. (What if we want to inherit that Latin-2 class in our Latin-1 class?) As far as I can see the only way to go is to allow all characters >= 161 and then use the 32 and 16 differences for case folding. (A case significant language like C or Modula-2 has some advantage here.) Operations ---------- When comparing two strings the collating sequence often has little relevance with the alphabet. The only languages I know it works for using ISO 646 are English, Danish, Norwegian and maybe Dutch. As a whole one should remember that the character type in this sense is not a simple enumerate. In many languages you only take regard to accents and umlauts when no other character is different. And some langauges have pairs of letters that sorts as one. (E.g. "ch" in Spanish, "rz" in Polish.) What you need is a set of extended comparison routines, a set of predefined langauges and a set of routines for loading your very own sorting order. Eiffel is extremely well prepared in this area, particulary with the additions of infix operators in 2.2. So all that is needed is some additions to the class library. Of course I could write them myself, but I think they should be in the standard library, since this is the way strings should be compared. Using the collating sequence is a very artificial way to do it. Or, is library additions really all we need? If we define a class TRANSCRIPTED_STRING which codes a string to some internal format for comparisons we would like to write: t_str : TRANSCRIPTED_STRING; ... t_str := "Some string"; But even if our new class is an heir of STRING, the assignment is not permitted. And defining a TRANSCRIPTED_CHARACTER for single elements as an heir to CHARACTER is out the question, since the latter class is an expanded one and may not be inherited from. One solution to these problems would of course be to inlcude the required operations within STRING and CHARACTER. There are probably some performance penalty for Americans who don't want more than simple ASCII comparison, but it's certainly a solution that looks very appealing. It should be added here, that there are various operating systems, not the least in the Unix sphere, that supports handling of more than one human language which includes run-time support for comparisons. But they are often intended for C and Eiffel gives room for much cleaner interfaces. In this article I have discussed very little of multi-byte characters, since I have no experience of using them. However, they should not be forgotten when addressing these problems. They write programs in Japan too. -- Erland Sommarskog - ENEA Data, Stockholm - sommar@enea.se Bowlers on strike!