Path: utzoo!censor!geac!torsqnt!lethe!yunexus!ists!helios.physics.utoronto.ca!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!cme!cam!koontz From: koontz@cam.nist.gov (John E. Koontz X5180) Newsgroups: comp.text Subject: Polyglot List Issue Keywords: character sets Message-ID: <6600@alpha.cam.nist.gov> Date: 14 Jan 91 16:27:18 GMT Organization: National Institute of Standards & Technology, Gaithersburg, MD Lines: 348 Since I cross posted some comp.text contributions to the Polyglot list which inspired replies, I am posting the replies and some supporting material here for the benefit of the original posters. Date: Fri, 11 Jan 91 13:57:31 -0600 To: Polyglot@tira.uchicago.edu From: Polyglot-request@tira.uchicago.edu Subject: Polyglot Digest V2 #2 -------- __________________________ P O L Y G L O T _________________________ POLYGLOT -- A Mailing List Devoted to Multilingual Computing The Center for Information and Language Studies Contributions to: polyglot@tira.uchicago.edu Administrative requests to: polyglot-request@tira.uchicago.edu Anonymous ftp archive: tira.uchicago.edu:polyglot ____________________________________________________________________ Polyglot Digest Friday, 11 Jan 1991 Volume 2 : Issue 2 Today's Topics: Administrivia Unicode Progress Report International character set requirements needed GNU Emacs and 8-bit text smtp interest 8-bit cleaning, Unicode, etc. ------------------------------------------------------------ Date: Fri, 11 Jan 91 13:45:01 -0600 From: scott@sage.uchicago.edu (Scott Deerwester) Subject: Administrivia Well! It appears that the only problem with Polyglot was that most people forgot about it! The distribution of issue 2/1 prompted a number of responses to the international character set requirements discussion, which form the bulk of issue 2. Two announcements complete the issue. First, John Koontz forwards a Unicode progress report from Asmus Freytag. The final article is an announcement about work that der Mouse is doing on GNU emacs and 8-bit characters. Enjoy! And *please* contribute! Submissions from various news groups are welcome. Scott Deerwester Center for Information and Language Studies University of Chicago ... ------------------------------ Date: Thu, 10 Jan 91 11:53:37 -0800 From: Tom McFarland Subject: International character set requirements needed Scott, You might want to put out a correction to V2 #1. Both Unicode and ISO 10646 are useful encodings... however, they are not similar or closely related as various posters in V2#1 indicate. Tom McFarland Hewlett-Packard, Co. Interface Technology Operation Internationalization Team >From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >Newsgroups: comp.text > >keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > >This is what Unicode is for. Unicode should be considered the most >useful and implementable subset of the draft standard ISO 10646. Unicode can be no stretch of the imagination be considered a subset (either proper or improper) of ISO 10646. While both attempt to address the same objective, their similarities end there. ISO 10646 is standard being developed by official national representatives; Unicode is a grass roots based, competing code set being proposed by a group of vendors. > ... The reason 16 bits are enough is that Asian pictographs which >everyone would recognize as the same have been unified. Thus, more >than 31,000 characters have been reduced to about 20,000 slots. Not everyone recognizes this. In fact, enough people disagree with this as to vote down exactly this proposed change in the ISO group drafting 10646. As I remember it, Japan was the major opponent to this modification. >------------------------------ > >Date: Fri, 04 Jan 91 09:44:29 -0700 >From: koontz@alpha.bldr.nist.gov (John E. Koontz) >Subject: International character set requirements needed > >Forwarded message follows: > >From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >Newsgroups: comp.text > >keld@login.dkuug.dk (Keld J|rn Simonsen) writes: >> >> Is UNICODE a true subset of ISO 10646? >> Is there a well defined relation between ISO 10646 encoding and UNICODE? > >ISO 10646 is still in draft form. Both questions are impossible to >answer until 10646 gets finalized. The draft is fairly stable, and the questions are not that difficult to answer. UNICODE is not a true subset of ISO 10646 - the two encoding methods are similar only in their attempt to address the same problem set. As for there being a well defined relation between 10646 and Unicode... the author answers his own question in the trailing paragraphs: there is not a one-to-one mapping and data may be lost converting between the two. >Disclaimer: I'm not an expert in this area. >However, extrapolating from what I know, it appears that Unicode >could be considered a 16-bit implementation of 10646. The ISO 10646 >draft standard appears to permit 16-bit implementations of any subset >thereof, for use in process code or communication. ISO 10646 is very specific in the forms of use allowed. One key difference that comes to mind is that ISO prohibits assigning character to row/column/plane/group values in the range 0x00-0x20, 0x7f-0xA0, and 0xff. ISO has done this in an attempt to maintain some level of backwards compatibility with hardware/software that recognize these values as control codes. Unicode actively uses these values to achieve its compactness. >It just so happens that Unicode covers all Asian characters >enumerated by existing national standards, plus characters from >languages that the 10646 draft hasn't even thought about. So it may >be a subset, but a largely complete subset. >There have been attempts to convert Unicode to 10646 and back again, >I believe with mostly good results. Of course, some data may be lost >in the translation. ------------------------------ Date: Thu, 10 Jan 91 21:04:56 -0500 From: der Mouse Subject: GNU Emacs and 8-bit text I've been sort of wondering about a good place to mention this, and today's polyglot digest reminded me of its existence :-) I have extended the display support in GNU emacs 18.55.95 to support display of 8-bit text. (I have offered my changes to Stallman, but he tells me that version 19 already addresses the problem, and he'd rather work on getting 19 out than on updating 18.*.) The changes can actually be used for other things as well, as you'll see from the description below.... The changes eliminate the ctl-arrow variable and create two new functions: set-chardisp Set the way a character displays in the current buffer (or set the default). The first argument is the character whose display is to be set, or nil; the second is the string it is to display as, or nil. (Each character of this string is assumed to occupy one screen position.) If the third argument is omitted or is nil, the current buffer's display is set; if it's a buffer, that buffer's display is set; otherwise, the default display is set. If the character is nil, all 256 entries of the table are set; if the string is nil, the display is set to the default (for a buffer, it uses the default value; for the default, the built-in default display is restored). Passing nil as both of the first two arguments works sensibly. get-chardisp Get the way a character displays in the current buffer (or the default). The first argument is the character whose display string is to be returned, or nil. If the second argument is omitted or is nil, the current buffer's display is returned; if it's a buffer, that buffer's display is returned; otherwise, the default display string (used for buffers that haven't specifically set a string, or for contexts where no buffer is readily available) is returned. If the character is non-nil, that character's display string is returned; if not, a 256-element vector is returned, listing all the display strings for the buffer (or default) requested. The returned value is always a copy; modifying it will not affect the display. Use set-chardisp to change the display. and two new variables default-special-tab-display Default special-tab-display for buffers that do not override it. This is the same as (default-value 'special-tab-display). special-tab-display Display tabs by moving to tab stops (as opposed to displaying as control-I). Non-nil means to display by tabbing; nil means to display tabs as if they were any other control character. Automatically becomes local when set in any fashion. (The idea is that if your display device can display 8-bit text directly, you use set-chardisp to set each of the high-half characters to display as itself (ie, as a one-character string); if not, you can do things like making e-acute display as and o-slash as .) The diffs are under 24K. I have not yet gotten around to doing the corresponding things to the input code. I can mail the diffs, put them up for anonymous ftp, or even post them somewhere. Let me know what you think should be done (unless, of course, you don't care at all :-). I also have not written any lisp code to use the new primitives. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu ------------------------------ Date: Fri, 11 Jan 91 10:07:12 +0000 From: Glenn.Wright@UK.Sun.COM (Glenn Wright - Sun EHQ - Mktg) Subject: smtp interest Keld, I noted your posting to polyglot, re: sendmail issues. Were you aware that the IETF (Internet Engineering Task Force) is currently studting the means by which non ASCII mail can be sent using the SMTP protocol? I believe they are working on an extended SMTP form (ESMTP). I wonder if this will be successful? Glenn Wright, Sun Microsystems. Here is some information on the group: IETP - ------ The goals of the group are the following: o Incorporate compatibility with the new host requirements document. o Allow binary data in message bodies and remove line length restrictions o Allow command pipelining (batched smtp) o enhance the maintainability/management of mail systems o Draft a managment information base for use by network managment systems. o and perhaps expand the header alphabet somewhat. Things the group does not intended to do: o attempt to mimic the functionality of X.400 o produce a major re-write of the rfc821/822 mail format o make changes to the header structure. Our strategy is develop an update of rfc1154 (content-type header) to better meet the needs of having multiple character sets and encodings. Basically we want to seperate out the notion of content type versus the encoding of that content type. This should allow gateways between binary and non-binary capable mail systems to make intelligent choices about encoding data. We'll also most likely formalize the content-length header field. We would then encourage people to publish documents (rfcs) describing the data types and encodings which they wish to use. Members of the group will be working on formalization of the Text-Hex encoding scheme. This encoding scheme allows for the representation of 8 bit characters as an ascii escape sequence. This should allow a variety of additional character types in mail headers without the need for changing the header specifications. Also this could encoding could be used on the bodies of messages that are "mostly" 7bit character sets. Several european languages fall into this area. ------------------------------ Date: Fri, 11 Jan 91 17:26:45 +0100 From: macrakis@gr.osf.org Subject: 8-bit cleaning, Unicode, etc. A few comments on the international character set discussion: 1) Converting existing 7-bit programs or protocols to be 8-bit clean is almost always `trivial', in some sense, but it does require a non-negligeable amount of work and even thought. If one subroutine uses the top bit to mark some property or uses a particular character as a sentinel, it has to be first identified and then fixed. And then some other way has to be found to represent the character properties or string length--this may have secondary effects in many places. I'm thinking of both the troff and the sendmail/SMTP discussion. 2) troff, the Unix variant of the CTSS runoff program (1963!!) is ancient technology, and making it 8-bit clean strikes me as a rear-guard action. 3) The OSF/1 system is 8-bit clean (it may even have a clean troff). 4) GNU Emacs, unlike vi (which was a notorious user of the 8th bit, in fact), has <> been 8-bit clean. In fact, you can use it to edit binary files! On the other hand, only the latest versions allow you to type in and display 8-bit graphic characters 5) There does NOT appear to be a clean, standard, and reasonable way to specify which character set you're using in a given file nor any way to switch character sets within a file. 6) Latin-1 does indeed cover all of Western Europe, but does <> cover Greek, and therefore does not cover all the EEC. 7) The ISO 646 alternate national characters are handy for unilingual environments, but are a disaster for multilingual environments. 8) Unicode seems very nice. Characters are a fixed 16 bits, which greatly simplifies processing. However, there is the notion of diacritical marks (accents, vowel points, etc.), any number of which may follow a base character. Messier to handle (I do not have the full spec so I don't know how it's done) are the several double diacritics (modifying two characters at a time). Of course, most programs do not care. 9) I have not seen ISO 10646, but it seems crazy to go from a fixed-width 16-bit character set to a variable-width character set just to represent the same Chinese character multiple times. ------------------------------ End of Polyglot Digest **********************