Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!unix.cis.pitt.edu!djbpitt
From: djbpitt@unix.cis.pitt.edu (David J Birnbaum)
Newsgroups: comp.std.internat
Subject: Re: Unicode vs ISO DIS 10646 (was universality of Latin-1)
Message-ID: <124144@unix.cis.pitt.edu>
Date: 4 May 91 20:55:19 GMT
References: <10003@plains.NoDak.edu> <ENAG.91May3200814@maud.ifi.uio.no> <1991May4.180549.29162@voa3.VOA.GOV>
Organization: University of Pittsburgh
Lines: 65

In article <1991May4.180549.29162@voa3.VOA.GOV> ck@voa3.VOA.GOV 
(Chris Kern) writes:

>I confess that I don't understand the problem.  Regardless of the
>attributes of the underlying language, is there some reason why I
>should care whether a character-diacritic combination is stored as
>one code or two as long as (a) its image is properly rendered when
>I need to look at it and (b) a program which consumes a text stream
>that includes such (character-diacritic) combinations can
>unambiguously determine its content?

Yes and no.  One could encode English logographically, but we don't
do it because (among other things) people don't process English text
logographically; they do it by character.  Similarly, we can encode
Hebrew consonant plus vertically aligned vowel points and cantilla-
tion marks as single characters, but people don't work with Hebrew
text this way.

One practical consequence of encoding vowel+accentual_diacritic
variously is the way it affects natural classes.  I can search for
all words with long rising accents in Serbocroatian (graphically
an acute) more easily if the acute is a separate character.  I
would not want to conduct such a search in French, where letters
with acute do not constitute a natural class (i.e., where "acute"
does not have an independent meaning), but this is not an unnatural
type of search to make in Serbocroatian, comparable to searching for
all words with any other letter.

As another example, I can
strip the (orthographically optional) accents from a Serbocroatian
text more efficiently by searching and deleting the five accentual
diacritics than by searching and replacing each accented vowel by
its unaccented counterpart.  Again, this is not something one would
normally want to do for French.

One other issue is efficient use of character cells.  If there is
a small number of vowels and a small number of accent marks (to
use a common example and imprecise terminology), there isn't a lot
at stake.  But take a system with lots of vowel letters, lots of
accent marks, the possibility of multiple accent marks on a single
vowel, and you're talking about a lot of character cells if each 
one is to be treated as an indivisible unit.  And it is writing
systems where accent marks are productive units that are combined
ad hoc with a natural class of letters (such as vowels) that have
this large number of combinations.

At a certain level, the answer to your question is that it doesn't
matter.  This seems to be why 10646 and Unicode have been able to
take opposing positions on the issue; both are concerned with form,
rather than function, and anything that arrives at the correct form
fulfills the minimal requirements.  But there are plenty of writing
systems that aren't like English or like French and where you can
only support the full inventory of complex combinations either by
storing the combination as a sequence or by dedicating an extremely
large number of character cells.  For orthographies like this, the
former is more efficient and corresponds more directly to the types
of operations that users may want to perform on the text.

--David

=======================================================================
Professor David J. Birnbaum         djbpitt@vms.cis.pitt.edu [Internet]
The Royal York Apartments, #802     djbpitt@pittvms.bitnet   [Bitnet]
3955 Bigelow Boulevard              voice: 1-412-687-4653
Pittsburgh, PA  15123  USA          fax:   1-412-624-9714