Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!uunet!mcsun!ukc!icdoc!sappho!cdsm
From: cdsm@sappho.doc.ic.ac.uk (Chris Moss)
Newsgroups: comp.lang.prolog
Subject: Re: Non-ASCII characters, suggestion and question
Message-ID: <1067@gould.doc.ic.ac.uk>
Date: 16 Oct 89 18:16:26 GMT
References: <2422@munnari.oz.au>
Sender: news@doc.ic.ac.uk
Reply-To: cdsm@doc.ic.ac.uk (Chris Moss)
Organization: Logic Group, Dept. of Computing, Imperial College, London, UK.
Lines: 63

Richard O'Keefe writes:
>Consider the problems of someone trying to write Prolog code
>which handles words in a language other than English.  

Your message prompted me to look at the latest Japanes proposal that
was sent out by Roger Scowen on 2 Oct, just before the Ottawa meeting
of the ISO Prolog standardization committee.
(Richard, they sent out your comments on I/O in the same mailing)
It's called "Multi-octet character sets in Prolog" by Makoto Negishi,
Yoshitomi Marisawa, Morihiko Tajima and Katsuhiko Nakamura, dated Sep. 1989.

I will try and summarise the proposals, and add my comments, indented.

1. It adds an "extended identifier indicator char" to the definition of
"identifier token" which is "implementation defined". "For example it may
include small letter a with grave accent, small letter a with acute accent,
etc. and Japanese characters". It similarly adds an "extended variable
indicator char" for starting variables.

    i.e. _any_ characters can be added for atoms within a strict definition
    of the standard. This would seem to make portability of programs across
    national boundaries rather nightmarish.

2. Collating sequence. It suggests the standard should only define an
alphabetical ordering within three groups of characters - small letters,
capital letters and digits. Anything else is based on an extended
collating sequence which is implementation defined.

    This thus seems to throw away even the rather ill-defined "subset of
    ISO 8859" which is referred to in 7.5 of the N40 document. Presumably
    any characterset, even EBCDIC, would qualify.

3. Character equivalence. They define a bip called "set_equivalence_char"
which maps characters which equivalences extended characters into
the base character set. A call to this predicate sets up a dynamic
equivalence.

    I assume this is basically for input routines - if one gets a multi-octet
    character which is also in the basic character set (8859?) then it
    is automatically converted. They suggest it can also be used italic
    characters etc., and this wouldn't be symmetrical on output.

They don't address the way in which strings represent multi-octet characters
except by example - they refer to N32 and N34 which I don't appear to have
received  (the numbers refer to the ISO numbering for standardization
documents).  Examples are " $@!N (J" and " $@#A (J".  They mostly assume the
use of the Japanese standard JIS X 0208.

    -----------------

Comment:
As far as I can see, these totally miss solving any of the problems!
How can one scan a program if one doesn't know what characters are used in
atoms, variables etc.? One needs some type of declarations to tell the
processor what to expect. I don't know why the representation of octets in
strings is so strange, maybe someone can enlighten me. But it doesn't solve any
of Richard's problems.

I could post the document to the net, tho it appears to be missing some
figures.

So much for now!
Chris Moss cdsm@doc.ic.ac.uk