Path: utzoo!censor!geac!torsqnt!lethe!yunexus!ists!helios.physics.utoronto.ca!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!cme!cam!koontz
From: koontz@cam.nist.gov (John E. Koontz X5180)
Newsgroups: comp.text
Subject: Polyglot List Issue
Keywords: character sets
Message-ID: <6600@alpha.cam.nist.gov>
Date: 14 Jan 91 16:27:18 GMT
Organization: National Institute of Standards & Technology, Gaithersburg, MD
Lines: 348

Since I cross posted some comp.text contributions to the Polyglot list
which inspired replies, I am posting the replies and some supporting 
material here for the benefit of the original posters.

Date: Fri, 11 Jan 91 13:57:31 -0600
To: Polyglot@tira.uchicago.edu
From: Polyglot-request@tira.uchicago.edu
Subject:  Polyglot Digest V2 #2

--------
__________________________ P O L Y G L O T _________________________

    POLYGLOT --  A Mailing List Devoted to Multilingual Computing
	   The Center for Information and Language Studies

Contributions to: 			  polyglot@tira.uchicago.edu
Administrative requests to:	  polyglot-request@tira.uchicago.edu
Anonymous ftp archive:			  tira.uchicago.edu:polyglot
____________________________________________________________________
Polyglot Digest                           Friday, 11 Jan 1991
                      Volume 2 : Issue 2
	
Today's Topics:

                             Administrivia
                        Unicode Progress Report
            International character set requirements needed
                       GNU Emacs and 8-bit text
                             smtp interest
                     8-bit cleaning, Unicode, etc.

------------------------------------------------------------

Date:    Fri, 11 Jan 91 13:45:01 -0600
From:    scott@sage.uchicago.edu (Scott Deerwester)
Subject: Administrivia

Well!  It appears that the only problem with Polyglot was that most
people forgot about it!  The distribution of issue 2/1 prompted a
number of responses to the international character set requirements
discussion, which form the bulk of issue 2.  Two announcements
complete the issue.  First, John Koontz forwards a Unicode progress
report from Asmus Freytag.  The final article is an announcement about
work that der Mouse is doing on GNU emacs and 8-bit characters.

Enjoy!  And *please* contribute!  Submissions from various news groups
are welcome.

	Scott Deerwester
	Center for Information and Language Studies
	University of Chicago

...

------------------------------

Date:    Thu, 10 Jan 91 11:53:37 -0800
From:    Tom McFarland <tommc@hpcvlx.cv.hp.com>
Subject: International character set requirements needed

Scott,

You might want to put out a correction to V2 #1.  Both Unicode and ISO
10646 are useful encodings... however, they are not similar or closely
related as various posters in V2#1 indicate.

Tom McFarland
Hewlett-Packard, Co.
Interface Technology Operation
Internationalization Team
<tommc@cv.hp.com>


>From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
>Newsgroups: comp.text
>
>keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>
>This is what Unicode is for.  Unicode should be considered the most
>useful and implementable subset of the draft standard ISO 10646.

Unicode can be no stretch of the imagination be considered a subset
(either proper or improper) of ISO 10646.  While both attempt to
address the same objective, their similarities end there.  ISO 10646
is standard being developed by official national representatives;
Unicode is a grass roots based, competing code set being proposed by a
group of vendors.

> ...  The reason 16 bits are enough is that Asian pictographs which
>everyone would recognize as the same have been unified.  Thus, more
>than 31,000 characters have been reduced to about 20,000 slots.

Not everyone recognizes this.  In fact, enough people disagree with
this as to vote down exactly this proposed change in the ISO group
drafting 10646.  As I remember it, Japan was the major opponent to
this modification.


>------------------------------
>
>Date:    Fri, 04 Jan 91 09:44:29 -0700
>From:    koontz@alpha.bldr.nist.gov (John E. Koontz)
>Subject: International character set requirements needed
>
>Forwarded message follows:
>
>From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill)
>Newsgroups: comp.text
>
>keld@login.dkuug.dk (Keld J|rn Simonsen) writes:
>>
>> Is UNICODE a true subset of ISO 10646?
>> Is there a well defined relation between ISO 10646 encoding and UNICODE?
>
>ISO 10646 is still in draft form.  Both questions are impossible to
>answer until 10646 gets finalized.

The draft is fairly stable, and the questions are not that difficult
to answer.  UNICODE is not a true subset of ISO 10646 - the two
encoding methods are similar only in their attempt to address the same
problem set.  As for there being a well defined relation between 10646
and Unicode... the author answers his own question in the trailing
paragraphs: there is not a one-to-one mapping and data may be lost
converting between the two.

>Disclaimer: I'm not an expert in this area.
>However, extrapolating from what I know, it appears that Unicode
>could be considered a 16-bit implementation of 10646.  The ISO 10646
>draft standard appears to permit 16-bit implementations of any subset
>thereof, for use in process code or communication.

ISO 10646 is very specific in the forms of use allowed.  One key
difference that comes to mind is that ISO prohibits assigning
character to row/column/plane/group values in the range 0x00-0x20,
0x7f-0xA0, and 0xff.  ISO has done this in an attempt to maintain some
level of backwards compatibility with hardware/software that recognize
these values as control codes.  Unicode actively uses these values to
achieve its compactness.

>It just so happens that Unicode covers all Asian characters
>enumerated by existing national standards, plus characters from
>languages that the 10646 draft hasn't even thought about.  So it may
>be a subset, but a largely complete subset.

>There have been attempts to convert Unicode to 10646 and back again,
>I believe with mostly good results.  Of course, some data may be lost
>in the translation.

------------------------------

Date:    Thu, 10 Jan 91 21:04:56 -0500
From:    der Mouse <mouse@lightning.McRCIM.McGill.EDU>
Subject: GNU Emacs and 8-bit text

I've been sort of wondering about a good place to mention this, and
today's polyglot digest reminded me of its existence :-)

I have extended the display support in GNU emacs 18.55.95 to support
display of 8-bit text.  (I have offered my changes to Stallman, but he
tells me that version 19 already addresses the problem, and he'd
rather work on getting 19 out than on updating 18.*.)  The changes can
actually be used for other things as well, as you'll see from the
description below....

The changes eliminate the ctl-arrow variable and create two new
functions:

set-chardisp

  Set the way a character displays in the current buffer (or set the
  default).  The first argument is the character whose display is to
  be set, or nil; the second is the string it is to display as, or
  nil.  (Each character of this string is assumed to occupy one screen
  position.)  If the third argument is omitted or is nil, the current
  buffer's display is set; if it's a buffer, that buffer's display is
  set; otherwise, the default display is set.  If the character is
  nil, all 256 entries of the table are set; if the string is nil, the
  display is set to the default (for a buffer, it uses the default
  value; for the default, the built-in default display is restored).
  Passing nil as both of the first two arguments works sensibly.

get-chardisp

  Get the way a character displays in the current buffer (or the
  default).  The first argument is the character whose display string is
  to be returned, or nil.  If the second argument is omitted or is nil,
  the current buffer's display is returned; if it's a buffer, that
  buffer's display is returned; otherwise, the default display string
  (used for buffers that haven't specifically set a string, or for
  contexts where no buffer is readily available) is returned.  If the
  character is non-nil, that character's display string is returned; if
  not, a 256-element vector is returned, listing all the display strings
  for the buffer (or default) requested.  The returned value is always a
  copy; modifying it will not affect the display.  Use set-chardisp to
  change the display.

and two new variables

default-special-tab-display

  Default special-tab-display for buffers that do not override it.
  This is the same as (default-value 'special-tab-display).

special-tab-display

  Display tabs by moving to tab stops (as opposed to displaying as
  control-I).  Non-nil means to display by tabbing; nil means to
  display tabs as if they were any other control character.
  Automatically becomes local when set in any fashion.

(The idea is that if your display device can display 8-bit text
directly, you use set-chardisp to set each of the high-half characters
to display as itself (ie, as a one-character string); if not, you can
do things like making e-acute display as <e'> and o-slash as <o/>.)

The diffs are under 24K.  I have not yet gotten around to doing the
corresponding things to the input code.  I can mail the diffs, put
them up for anonymous ftp, or even post them somewhere.  Let me know
what you think should be done (unless, of course, you don't care at
all :-).  I also have not written any lisp code to use the new
primitives.

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

------------------------------

Date:    Fri, 11 Jan 91 10:07:12 +0000
From:    Glenn.Wright@UK.Sun.COM (Glenn Wright - Sun EHQ - Mktg)
Subject: smtp interest

Keld,

I noted your posting to polyglot, re: sendmail issues.  Were you aware
that the IETF (Internet Engineering Task Force) is currently studting
the means by which non ASCII mail can be sent using the SMTP protocol?
I believe they are working on an extended SMTP form (ESMTP). I wonder
if this will be successful?

Glenn Wright,
Sun Microsystems.

Here is some information on the group:

IETP
- ------

The goals of the group are the following:

  o Incorporate compatibility with the new host requirements document.

  o Allow binary data in message bodies and remove line length restrictions

  o Allow command pipelining (batched smtp)

  o enhance the maintainability/management of mail systems

  o Draft a managment information base for use by network managment
    systems.

  o and perhaps expand the header alphabet somewhat.

Things the group does not intended to do:

  o attempt to mimic the functionality of X.400

  o produce a major re-write of the rfc821/822 mail format

  o make changes to the header structure.


Our strategy is develop an update of rfc1154 (content-type header) to
better meet the needs of having multiple character sets and encodings.
Basically we want to seperate out the notion of content type versus
the encoding of that content type.  This should allow gateways between
binary and non-binary capable mail systems to make intelligent choices
about encoding data.  We'll also most likely formalize the
content-length header field.  We would then encourage people to
publish documents (rfcs) describing the data types and encodings which
they wish to use.

Members of the group will be working on formalization of the Text-Hex
encoding scheme.  This encoding scheme allows for the representation
of 8 bit characters as an ascii escape sequence.  This should allow a
variety of additional character types in mail headers without the need
for changing the header specifications.  Also this could encoding
could be used on the bodies of messages that are "mostly" 7bit
character sets.  Several european languages fall into this area.

------------------------------

Date:    Fri, 11 Jan 91 17:26:45 +0100
From:    macrakis@gr.osf.org
Subject: 8-bit cleaning, Unicode, etc.

A few comments on the international character set discussion:

1) Converting existing 7-bit programs or protocols to be 8-bit clean
   is almost always `trivial', in some sense, but it does require a
   non-negligeable amount of work and even thought.  If one subroutine
   uses the top bit to mark some property or uses a particular
   character as a sentinel, it has to be first identified and then
   fixed.  And then some other way has to be found to represent the
   character properties or string length--this may have secondary
   effects in many places.

   I'm thinking of both the troff and the sendmail/SMTP discussion.

2) troff, the Unix variant of the CTSS runoff program (1963!!) is
   ancient technology, and making it 8-bit clean strikes me as a
   rear-guard action.

3) The OSF/1 system is 8-bit clean (it may even have a clean troff).

4) GNU Emacs, unlike vi (which was a notorious user of the 8th bit, in
   fact), has <<always>> been 8-bit clean.  In fact, you can use it to
   edit binary files!  On the other hand, only the latest versions
   allow you to type in and display 8-bit graphic characters

5) There does NOT appear to be a clean, standard, and reasonable way
   to specify which character set you're using in a given file nor any
   way to switch character sets within a file.

6) Latin-1 does indeed cover all of Western Europe, but does <<not>>
   cover Greek, and therefore does not cover all the EEC.

7) The ISO 646 alternate national characters are handy for unilingual
   environments, but are a disaster for multilingual environments.

8) Unicode seems very nice.  Characters are a fixed 16 bits, which
   greatly simplifies processing.  However, there is the notion of
   diacritical marks (accents, vowel points, etc.), any number of
   which may follow a base character.  Messier to handle (I do not
   have the full spec so I don't know how it's done) are the several
   double diacritics (modifying two characters at a time).  Of course,
   most programs do not care.

9) I have not seen ISO 10646, but it seems crazy to go from a
   fixed-width 16-bit character set to a variable-width character set
   just to represent the same Chinese character multiple times.

------------------------------

End of Polyglot Digest
**********************