Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!mcvax!enea!kuling!andersa
From: andersa@kuling.UUCP (Anders Andersson)
Newsgroups: comp.std.internat
Subject: Re: Character representation
Message-ID: <476@kuling.UUCP>
Date: Sat, 22-Aug-87 16:58:50 EDT
Article-I.D.: kuling.476
Posted: Sat Aug 22 16:58:50 1987
Date-Received: Sun, 23-Aug-87 21:42:50 EDT
References: <2171@enea.UUCP> <709@maccs.UUCP> <2183@enea.UUCP> <719@maccs.UUCP>
Reply-To: andersa@kuling.UUCP (Anders Andersson)
Organization: Uppsala University, Sweden
Lines: 50

In article <719@maccs.UUCP> gordan@maccs.UUCP (Gordan Palameta) writes:
>t umlaut or q cedilla would probably be used very rarely, nor is it likely
>that anyone would go to the trouble of designing a font to accomodate such
>characters.  Another cost of such generality would be that accents and other
>marks would probably have to be indicated by escape sequences in conjunction
>with the unmodified letter.  This would make string-processing software more
>complicated (and slower), and text would be longer.

The existance of something like four different 94-character "Latin" sets
suggests that 8 bits wouldn't suffice anyway, although I haven't counted the
exact number of existing glyph combinations. I believe Welsh includes some
strange accented consonants, but I don't remember which (maybe w^). If you
also take Vietnamese into account, which allows several accents used at the
same time, you'd definitely overflow the table. TeX provides for arbitrary
combinations of accents, and I think this approach is quite simple (although
I don't suggest TeX for the encoding scheme to be used for files in general).

I don't think somebody manually has to design a font for each and every
combination, as the acute accent over e looks pretty much the same as the
acute accent over o, and the combination could be done automatically at
display-time. Some characters will need special treatment though, like
capital Swedish A with circle above (they should usually touch each other)
and Polish bar-crossed L. The amount of programming and CPU power to be
used for this depends on what quality and resolution of display you require.

If this general approach turns out to be the most practical one technically,
some people may of course go hog wild putting circles under X and cedillas
over 7, but there is as little point in stopping them as in preventing
people from writing "fiYw#s" with a proportional font. Just apply the
general accent attachment rule and they'll be quiet...

>     if (coll[c] >= FIRST_CHAR && coll[c] <= LAST_CHAR)
>with very little loss of efficiency.  To accomodate perverse languages like
>Spanish and Polish which insist on two-letter combinations for sorting,

What about the thing "Mac" or "Mc" in English (Scottish?) proper names?
I agree this example is a little extreme in comparison to the Spanish
"graphemes" ch, ll and rr (?), as well as czech ch. Maybe the English
don't mind seeing "McDonald" sorted after "Machiavelli", or whatever the
rule is/was - has it been abolished by now?

There are different kinds of sorting even within one language, depending on
the context. Donald E. Knuth provides a wonderful collection of rules for
bibliographic use in the beginning of his "Fundamentals ..." volume on
Sorting & Searching, such as ignoring articles and spelling out numbers.
These rules don't apply to filenames in a UNIX directory, I think!
-- 
Anders Andersson, Dept. of Computer Systems, Uppsala University, Sweden
Phone: +46 18 183170
UUCP: andersa@kuling.UUCP (...!{seismo,mcvax}!enea!kuling!andersa)