Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!uunet!mcsun!ukc!icdoc!sappho!cdsm From: cdsm@sappho.doc.ic.ac.uk (Chris Moss) Newsgroups: comp.lang.prolog Subject: Re: Non-ASCII characters, suggestion and question Message-ID: <1067@gould.doc.ic.ac.uk> Date: 16 Oct 89 18:16:26 GMT References: <2422@munnari.oz.au> Sender: news@doc.ic.ac.uk Reply-To: cdsm@doc.ic.ac.uk (Chris Moss) Organization: Logic Group, Dept. of Computing, Imperial College, London, UK. Lines: 63 Richard O'Keefe writes: >Consider the problems of someone trying to write Prolog code >which handles words in a language other than English. Your message prompted me to look at the latest Japanes proposal that was sent out by Roger Scowen on 2 Oct, just before the Ottawa meeting of the ISO Prolog standardization committee. (Richard, they sent out your comments on I/O in the same mailing) It's called "Multi-octet character sets in Prolog" by Makoto Negishi, Yoshitomi Marisawa, Morihiko Tajima and Katsuhiko Nakamura, dated Sep. 1989. I will try and summarise the proposals, and add my comments, indented. 1. It adds an "extended identifier indicator char" to the definition of "identifier token" which is "implementation defined". "For example it may include small letter a with grave accent, small letter a with acute accent, etc. and Japanese characters". It similarly adds an "extended variable indicator char" for starting variables. i.e. _any_ characters can be added for atoms within a strict definition of the standard. This would seem to make portability of programs across national boundaries rather nightmarish. 2. Collating sequence. It suggests the standard should only define an alphabetical ordering within three groups of characters - small letters, capital letters and digits. Anything else is based on an extended collating sequence which is implementation defined. This thus seems to throw away even the rather ill-defined "subset of ISO 8859" which is referred to in 7.5 of the N40 document. Presumably any characterset, even EBCDIC, would qualify. 3. Character equivalence. They define a bip called "set_equivalence_char" which maps characters which equivalences extended characters into the base character set. A call to this predicate sets up a dynamic equivalence. I assume this is basically for input routines - if one gets a multi-octet character which is also in the basic character set (8859?) then it is automatically converted. They suggest it can also be used italic characters etc., and this wouldn't be symmetrical on output. They don't address the way in which strings represent multi-octet characters except by example - they refer to N32 and N34 which I don't appear to have received (the numbers refer to the ISO numbering for standardization documents). Examples are " $@!N (J" and " $@#A (J". They mostly assume the use of the Japanese standard JIS X 0208. ----------------- Comment: As far as I can see, these totally miss solving any of the problems! How can one scan a program if one doesn't know what characters are used in atoms, variables etc.? One needs some type of declarations to tell the processor what to expect. I don't know why the representation of octets in strings is so strange, maybe someone can enlighten me. But it doesn't solve any of Richard's problems. I could post the document to the net, tho it appears to be missing some figures. So much for now! Chris Moss cdsm@doc.ic.ac.uk