Newsgroups: comp.text Path: utzoo!sq!lee From: lee@sq.sq.com (Liam R. E. Quin) Subject: Re: SGML question Message-ID: <1990Sep11.144937.1860@sq.sq.com> Keywords: SGML, ambiguity Organization: SoftQuad Inc. References: <583@helios.prosys.se> <141873@sun.Eng.Sun.COM> <1990Sep10.170717.7993@unx.sas.com> Date: Tue, 11 Sep 90 14:49:37 GMT Lines: 32 tut@cairo.Sun.COM (Bill "Bill" Tuthill) writes: > [...] Unicode, a 16-bit code set that includes all known languages of the > world in a single, interchangeable code set. bts@unx.sas.com (Brian T. Schellenberger) writes: > [...] still coming in at less than 32,000 characters. Thus, it can easily > be fit into 15 bits. [Then] the 127 most common chacters [could fit in] 7 > bits. Then if the eighth bit is set, we know it starts a 15-bit character. It might make more sense to mandate that _all_ of the bytes of such an extended character, except the last, have the top bit set. Then there is no limit imposed on the number of characters, although of course some software might croak on 48-bit characters :-) This also means that your files aren't twice as big, or even four times as big, and you can still use lots of glyphs. This does mean that algorithms such as Boyer-Moore pattern matching have to look at at most one extra byte per probe in some cases, to ensure that a match isn't the last byte of a multi-byte character. With a shift, you'd have to look at every byte in the input to remember the current mode. With four-byte encodings you could have to look at up to three bytes. So my scheme is no worse than a plain two-byte encoding in this way either. You should also look at the work in progress by ISO committees such as ISO/IEC JTC 1/SC 18/WG8 on Font Information Interchange (e.g. N1036), and at ISO/IEC/DIS 9541-1. They're busily working on ways of transmitting font and glyph information about the place... Lee -- Liam R. E. Quin, lee@sq.com, SoftQuad Inc., Toronto, +1 (416) 963-8337 /text/humour/quote: No such file or directory