Newsgroups: comp.text
Path: utzoo!sq!lee
From: lee@sq.sq.com (Liam R. E. Quin)
Subject: Re: SGML question
Message-ID: <1990Sep11.144937.1860@sq.sq.com>
Keywords: SGML, ambiguity
Organization: SoftQuad Inc.
References: <583@helios.prosys.se> <141873@sun.Eng.Sun.COM> <1990Sep10.170717.7993@unx.sas.com>
Date: Tue, 11 Sep 90 14:49:37 GMT
Lines: 32

tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes:
> [...] Unicode, a 16-bit code set that includes all known languages of the
> world in a single, interchangeable code set.

bts@unx.sas.com (Brian T. Schellenberger) writes:
> [...] still coming in at less than 32,000 characters.  Thus, it can easily
> be fit into 15 bits.  [Then] the 127 most common chacters [could fit in] 7
> bits.  Then if the eighth bit is set, we know it starts a 15-bit character.

It might make more sense to mandate that _all_ of the bytes of such an
extended character, except the last, have the top bit set.  Then there is
no limit imposed on the number of characters, although of course some
software might croak on 48-bit characters :-)  This also means that your
files aren't twice as big, or even four times as big, and you can still use
lots of glyphs.

This does mean that algorithms such as Boyer-Moore pattern matching have to
look at at most one extra byte per probe in some cases, to ensure that a
match isn't the last byte of a multi-byte character.  With a shift, you'd
have to look at every byte in the input to remember the current mode.
With four-byte encodings you could have to look at up to three bytes.  So
my scheme is no worse than a plain two-byte encoding in this way either.

You should also look at the work in progress by ISO committees such as
ISO/IEC JTC 1/SC 18/WG8 on Font Information Interchange (e.g. N1036), and
at ISO/IEC/DIS 9541-1.  They're busily working on ways of transmitting font
and glyph information about the place...

Lee
-- 
Liam R. E. Quin,  lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337
/text/humour/quote: No such file or directory