Path: utzoo!attcan!uunet!tut.cis.ohio-state.edu!ucsd!sdd.hp.com!elroy.jpl.nasa.gov!ames!haven!decuac!shlump.nac.dec.com!mountn.dec.com!minow
From: minow@mountn.dec.com (Martin Minow)
Newsgroups: comp.std.misc
Subject: Re: Int'l Character set (Was: Re: filen
Message-ID: <1723@mountn.dec.com>
Date: 28 Jun 90 15:07:19 GMT
References: <10720@<1990Jun1> <17000002@WL9.Prime.COM> <1679@mountn.dec.com> <91@lysator.liu.se> <1676@hulda.erbe.se> <106@lysator.liu.se>
Reply-To: minow@bolt.enet.dec.com (Martin Minow)
Organization: Digital Equipment Corporation
Lines: 41

In article <106@lysator.liu.se> aronsson@lysator.liu.se (Lars Aronsson) writes:
>Is it the manufacturers who are stupid as they use standards outside
>their intended scoop or is it standard authors who are stupid as they
>write standards with too narrow scoop? ... I think that ISO should have
>written a general standard for storing and random order retrieval of text
>(that is, no escape sequences) rather than the six versions of ISO 8859.
>Perhaps, I am stupid.

No, I doubt that Lars is stupid; nor are the manufacturers or standard
writers.  The vast majority of text will be written in a single language
family, and, hence, in a single variant of ISO-8859 (which has more than
six versions, but that is irrelevant).  A text editor, to use Lars' example,
generally whould not see escape sequences in the embedded text (they would be
absorbed by the file/terminal read process), but would see a single 256 byte
dataspace (of which 94+95 are graphics).  Editors that do need to deal
with multiple ISO-8859 instances (say, an editor that must handle both
Swedish (ISO 8859-1) and Hebrew (ISO 8859-9) must establish its own
internal mechanism for this problem.  (In the one I wrote, each character
was represented by a 16-bit quantity that encoded both character and
character set designator.)

Manufacturers chose ISO 8859 (-1) in order to gain a consistent character
set representation among all applications and datafiles within a computer
system.  This seems to me to be a prefectly reasonable decision: having
multiple representations within a single system is, I can state from
experience, a mess.

The standard writers cannot, on the other hand, look outside the scope
of their "data transmission" standard.  For example, I might choose
to represent data within a system in a Huffman or Lempel-Ziv encoding:
as long as the external users see the ISO character set, I can claim
conformance to the standard without embarrasment.

There are, by the way, standards for storing and retrieval of text that
were developed by, among others, libraries and cancer registries.  Spending
a few hours in a good reference library will give you more standards
than you might wish.

Martin Minow
minow@bolt.enet.dec.com
The above does not represent the position of Digital Equipment Corporation