Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!gem.mps.ohio-state.edu!usc!ucsd!ames!uhccux!munnari.oz.au!cs.mu.oz.au!ok
From: ok@cs.mu.oz.au (Richard O'Keefe)
Newsgroups: comp.lang.prolog
Subject: Non-ASCII characters, suggestion and question
Message-ID: <2422@munnari.oz.au>
Date: 13 Oct 89 15:10:33 GMT
Sender: news@cs.mu.oz.au
Lines: 88

Consider the problems of someone trying to write Prolog code
which handles words in a language other than English.  There
are at least four contexts where the code may be used:

    -	a national variant of ISO 646
    -	ISO 8859/1 (or DEC MNCS, which is very close)
    -	MS-DOS character set
    -	Macintosh character set

{I omit all discussion of ISO 8859/N for N > 1 and alternative
character sets on the Macintosh or OS/2; not because I don't know
about these things but because this is hard enough already.}

Quintus Prolog already lets you write the magic number you need,
using C-style escapes.  For example, the Old English word which
became "whether" is spelled h,w,ae,eth,e,r, which we could write
as 'hw\xE6\xF0er'.  The problems with this are

(a) it is very hard to tell which letters are intended by looking at hex
(b) if the same program is moved to another system which uses a different
    coding, the numbers stay put, which means that you get different
    characters
(c) the other system may not have any coding for these characters at
    all, but you aren't warned.

Why not write the characters you want directly?
(a) It may not be possible.  Some editors on the PC do not give you
    direct access to the upper 128 characters.
(b) It is even less portable than writing escape codes.

If Prolog is to be used for writing programs that can be *portable*
between these environments, it is important that we should have some
way of indicating which characters we mean, so that they may be mapped
correctly and a warning may be given when they cannot be mapped.

The best scheme I have been able to come up with uses escape sequences
like
	\: <first letter> <second letter>
		for ligatures (ae, oe) and some others: lower and
		upper case thorn -> th, TH, ess-tset -> ss; and the
		copyright symbol is \:co
		
	\: <letter> <diactrical>
		<diactrical> is ` ^ ' " ~ , / . -
		for grave, circumflex, acute, umlaut, tilde,
		cedilla, slash, ring, macron (I should be so lucky)

	\: <other> <other>
		E.g. \:!! and \:?? for inverted ! and ?, \:<< and \:>>
		for Continental quotes.


"whether" would look like 'hw\:ae\:d/er' in this coding (I'm tempted to
code eth/Eth as dh/DH similar to thorn).  It is hard to read, but it is
still better than 'hw\xE6\xF0er', and it means that if we read the code
in a system which hasn't got ash or eth the tokeniser can print an
error message and substitute something `close' (eth and thorn -> t,
ash -> e, \:<letter><diacritical> drops the diacritical mark,
\:<other><other> turns into <other>).  Having '\:<<hw\:ae\:ther\:>>'
converted to '<hweter>' is a lot better than having it converted to garbage,
particularly if you get an error message when it happens.

I want to stress that I don't regard this as anything other than a
practical compromise; it would be better if the MS-DOS and Mac character
sets would dry up and blow away so that everyone was using ISO 8859/*
from now on, but that just isn't going to happen, and I think we need a
better way of coping than we have now.

So what's the question?

The question is whether diactrical marks should precede or follow the
letter they modify.  I prefer \:e' because I read it as "e-acute" and so
expect the diactrical mark second.  But I believe there is a French
convention that involves writing the diactrical mark first.

There's also a question about whether the characters I picked for the
diacritical marks are ok.

I was hoping that the BSI committee might be relied on to do something
about this problem (it is, after all, a syntax problem), but (a) they
haven't and (b) one of the latest documents I have claims that escape
sequences aren't needed inside atoms anyway, so I think we have to do it
ourselves, and do it soon.

If anyone can come up with a better suggestion, please do.  But remember
that it has to cover all the letters in the ISO 8859/1, MS-DOS and Mac
character sets, and should be a wee bit open-ended in case we've missed
something.