Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site plus5.UUCP Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!mtuxo!drutx!ihnp4!plus5!hokey%plus5 From: hokey%plus5@plus5.UUCP Newsgroups: mod.std.mumps Subject: Collating and pattern match outside of 7 bit ASCII Message-ID: <889@plus5.UUCP> Date: Fri, 20-Sep-85 20:19:21 EDT Article-I.D.: plus5.889 Posted: Fri Sep 20 20:19:21 1985 Date-Received: Sat, 21-Sep-85 06:19:44 EDT Sender: hokey@plus5.UUCP Reply-To: hokey@plus5.uucp Organization: Plus Five Computer Services, St. Louis Lines: 74 Approved: hokey@plus5.uucp Distribution: Volume-Issue: 2.3 The issue of collating and pattern matching needs to be addressed when Mumps exists in an environment which is anything other than 7 bit ASCII. The most extreme example is EBCDIC. While it will be useful to have a 7 bit ASCII emulation mode on an EBCDIC machine, there is also a need to operate Mumps in the native characterset. I would like to see the requirements for pattern match codes, $C()/$A() mapping, and collating sequence tailored to fit the environment, in order to provide implementors and users as much latitude as possible. This can best be done by specifying the behavior of pattern match codes, $C()/$A(), and collating sequences on a per-character-set basis, as well as an overall, general specification. Two other languages have already done this very thing: MAINSAIL and C. MAINSAIL (MAchine INdependent Stanford Artificial Intelligence Language) has this to say: 2.2 CHARACTER SET MAINSAIL does not specify the exact character set; instead, only the following is guaranteed: 1) A unique character corresponds to each of the following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz 0123456789 ! " # $ & ' ( ) * + , - . / : ; < = > ? [ ] ^ (uparrow) _ (backarrow) space (blank) tab (horizontal tab) eol (end-of-line: one or two characters) eop (end-of-page) Of course MAINSAIL cannot guarantee the graphics associated with each character, but they should be chosen to approximate those above, which are from the (1963) ASCII character set. The graphics for the "^" (uparrow) and "_" (backarrow) characters were changed in the 1968 ASCII standard to be circumflex and underline, respectively. MAINSAIL allows "**" to be used in place of "^" (the exponentiation operator), and ":=" in place of "_" (the assignment operator). 2) Associated with each character is an integer code. THese character codes range from 0 to n, where n is at least 127. 3) A...Z are alphabetically ordered, but not necessarily contiguous. 4) a...z are alphabetically ordered, but not necessarily contiguous. 5) 0...9 are numerically ordered and contiguous. Aside from functions which test for uppercase/lowercase/alpha characters, MAINSAIL also supplies prevAlpha(i) and nextAlpha(i), which do the obvious things when given b...zB...Z and a...yA...Y, respectively. The proposed C Standard (Ansi X3J11/85-008) says: The following characters are required in the source character set: the 52 upper-case and lower-case characters of the English alphabet; the 10 decimal digits; the following 29 graphic characters: !"#%&'()*+,-./:;<=>?[\]^_{|}~ the space character, and control characters representing horizontal tab, vertical tab, and form feed. Functions exist to test if a given character is Alpha, Numeric, Control, Printable, Punctuation (any printing character except SPACE, digit, or letter), and several combinations of these types. I was unable to find any information regarding either ordering or relative positioning of characters. Let's "open the doors" in the Standard to include IBM and non-English languages in a way which maximizes usability. Brought to you by Super Global Mega Corp .com