Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!gem.mps.ohio-state.edu!usc!ucsd!ames!uhccux!munnari.oz.au!cs.mu.oz.au!ok From: ok@cs.mu.oz.au (Richard O'Keefe) Newsgroups: comp.lang.prolog Subject: Non-ASCII characters, suggestion and question Message-ID: <2422@munnari.oz.au> Date: 13 Oct 89 15:10:33 GMT Sender: news@cs.mu.oz.au Lines: 88 Consider the problems of someone trying to write Prolog code which handles words in a language other than English. There are at least four contexts where the code may be used: - a national variant of ISO 646 - ISO 8859/1 (or DEC MNCS, which is very close) - MS-DOS character set - Macintosh character set {I omit all discussion of ISO 8859/N for N > 1 and alternative character sets on the Macintosh or OS/2; not because I don't know about these things but because this is hard enough already.} Quintus Prolog already lets you write the magic number you need, using C-style escapes. For example, the Old English word which became "whether" is spelled h,w,ae,eth,e,r, which we could write as 'hw\xE6\xF0er'. The problems with this are (a) it is very hard to tell which letters are intended by looking at hex (b) if the same program is moved to another system which uses a different coding, the numbers stay put, which means that you get different characters (c) the other system may not have any coding for these characters at all, but you aren't warned. Why not write the characters you want directly? (a) It may not be possible. Some editors on the PC do not give you direct access to the upper 128 characters. (b) It is even less portable than writing escape codes. If Prolog is to be used for writing programs that can be *portable* between these environments, it is important that we should have some way of indicating which characters we mean, so that they may be mapped correctly and a warning may be given when they cannot be mapped. The best scheme I have been able to come up with uses escape sequences like \: for ligatures (ae, oe) and some others: lower and upper case thorn -> th, TH, ess-tset -> ss; and the copyright symbol is \:co \: is ` ^ ' " ~ , / . - for grave, circumflex, acute, umlaut, tilde, cedilla, slash, ring, macron (I should be so lucky) \: E.g. \:!! and \:?? for inverted ! and ?, \:<< and \:>> for Continental quotes. "whether" would look like 'hw\:ae\:d/er' in this coding (I'm tempted to code eth/Eth as dh/DH similar to thorn). It is hard to read, but it is still better than 'hw\xE6\xF0er', and it means that if we read the code in a system which hasn't got ash or eth the tokeniser can print an error message and substitute something `close' (eth and thorn -> t, ash -> e, \: drops the diacritical mark, \: turns into ). Having '\:<>' converted to '' is a lot better than having it converted to garbage, particularly if you get an error message when it happens. I want to stress that I don't regard this as anything other than a practical compromise; it would be better if the MS-DOS and Mac character sets would dry up and blow away so that everyone was using ISO 8859/* from now on, but that just isn't going to happen, and I think we need a better way of coping than we have now. So what's the question? The question is whether diactrical marks should precede or follow the letter they modify. I prefer \:e' because I read it as "e-acute" and so expect the diactrical mark second. But I believe there is a French convention that involves writing the diactrical mark first. There's also a question about whether the characters I picked for the diacritical marks are ok. I was hoping that the BSI committee might be relied on to do something about this problem (it is, after all, a syntax problem), but (a) they haven't and (b) one of the latest documents I have claims that escape sequences aren't needed inside atoms anyway, so I think we have to do it ourselves, and do it soon. If anyone can come up with a better suggestion, please do. But remember that it has to cover all the letters in the ISO 8859/1, MS-DOS and Mac character sets, and should be a wee bit open-ended in case we've missed something.