Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!unix.cis.pitt.edu!djbpitt From: djbpitt@unix.cis.pitt.edu (David J Birnbaum) Newsgroups: comp.std.internat Subject: Re: Unicode vs ISO DIS 10646 (was universality of Latin-1) Message-ID: <124144@unix.cis.pitt.edu> Date: 4 May 91 20:55:19 GMT References: <10003@plains.NoDak.edu> <1991May4.180549.29162@voa3.VOA.GOV> Organization: University of Pittsburgh Lines: 65 In article <1991May4.180549.29162@voa3.VOA.GOV> ck@voa3.VOA.GOV (Chris Kern) writes: >I confess that I don't understand the problem. Regardless of the >attributes of the underlying language, is there some reason why I >should care whether a character-diacritic combination is stored as >one code or two as long as (a) its image is properly rendered when >I need to look at it and (b) a program which consumes a text stream >that includes such (character-diacritic) combinations can >unambiguously determine its content? Yes and no. One could encode English logographically, but we don't do it because (among other things) people don't process English text logographically; they do it by character. Similarly, we can encode Hebrew consonant plus vertically aligned vowel points and cantilla- tion marks as single characters, but people don't work with Hebrew text this way. One practical consequence of encoding vowel+accentual_diacritic variously is the way it affects natural classes. I can search for all words with long rising accents in Serbocroatian (graphically an acute) more easily if the acute is a separate character. I would not want to conduct such a search in French, where letters with acute do not constitute a natural class (i.e., where "acute" does not have an independent meaning), but this is not an unnatural type of search to make in Serbocroatian, comparable to searching for all words with any other letter. As another example, I can strip the (orthographically optional) accents from a Serbocroatian text more efficiently by searching and deleting the five accentual diacritics than by searching and replacing each accented vowel by its unaccented counterpart. Again, this is not something one would normally want to do for French. One other issue is efficient use of character cells. If there is a small number of vowels and a small number of accent marks (to use a common example and imprecise terminology), there isn't a lot at stake. But take a system with lots of vowel letters, lots of accent marks, the possibility of multiple accent marks on a single vowel, and you're talking about a lot of character cells if each one is to be treated as an indivisible unit. And it is writing systems where accent marks are productive units that are combined ad hoc with a natural class of letters (such as vowels) that have this large number of combinations. At a certain level, the answer to your question is that it doesn't matter. This seems to be why 10646 and Unicode have been able to take opposing positions on the issue; both are concerned with form, rather than function, and anything that arrives at the correct form fulfills the minimal requirements. But there are plenty of writing systems that aren't like English or like French and where you can only support the full inventory of complex combinations either by storing the combination as a sequence or by dedicating an extremely large number of character cells. For orthographies like this, the former is more efficient and corresponds more directly to the types of operations that users may want to perform on the text. --David ======================================================================= Professor David J. Birnbaum djbpitt@vms.cis.pitt.edu [Internet] The Royal York Apartments, #802 djbpitt@pittvms.bitnet [Bitnet] 3955 Bigelow Boulevard voice: 1-412-687-4653 Pittsburgh, PA 15123 USA fax: 1-412-624-9714