Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!rutgers!sri-unix!sri-spam!mordor!under!pom From: pom@under..ARPA (Peter O. Mikes) Newsgroups: sci.lang,comp.std.internat Subject: Character representation Message-ID: <15381@mordor.s1.gov> Date: Tue, 18-Aug-87 19:41:46 EDT Article-I.D.: mordor.15381 Posted: Tue Aug 18 19:41:46 1987 Date-Received: Thu, 20-Aug-87 06:13:52 EDT Sender: news@mordor.s1.gov Reply-To: pom@s1-under.UUCP () Organization: S-1 Project, LLNL Lines: 75 Xref: mnetor sci.lang:1167 comp.std.internat:119 To: gordan@maccs.UUCP Subject: Re: Character representation Newsgroups: comp.std.internat, sci.lang In-Reply-To: <719@maccs.UUCP> In article <719@maccs.UUCP> you write: >Followups to alt.universes. I am sorry, but according to latest QM, the multiple universes not only keep splitting, they also merge. This happens to be one such feedback from Alternative Universe. Besides I have VERY CONSTRUCTIVE ( insiders info ) FACT on ceddilas, umlats, haceks, and other such ... [modifiers] namely : In all langauges I know, there are many kinds, but ANY PARTICULAR LETTER either has one - or it does not. That means that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified). to take care of dozens of languages. So e.g. if switch ( ROM, printwheel,..) is set to German , modified o will put two dots ( umlaut) above o; In Czech the same bit will put ' above 'aeiou' but will put inverted ^ over consonants ( since only 'aeiou' are allowed to have ' and only consonants can have ^, and so it goes. >>But this doesn't >>address all problems I mentioned. How to construct a general character >>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit >>enumarate isn't sufficient. The problem you (somebody) mentioned is hereby addressed. To disprove my conjecture, name one language with Latin-based alphabet and one letter in that alphabet, which admits more then one modifier. Oh, just BTW - using poor ASCII, which has no modifier bit, I am using the convention that modifier is indictaed by h ( e.g. a word: (modified_s)ot would appear as shot. (which is quite wastefull as whole h is needed to perform function of one bit). ( I am not quite sure if all mono-anglo-phones realise that english is actually using pairs for sounds ( english sh is perverse Hungarian's sz is actualy one sound (soft s or s^). The difference is mostly in that english is ambiguous and arbitrary and (on the positive side) makes collating based on singles ( but anybody can accept that, since you get your pairs sz - sorted in same sequence (almost) always anyway. . There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte In a related posting >--- David Phillip Oster --My Good News: "I'm a perfectionist." >Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour." WRITES > There is no reason why we couldn't use a huffman encoding >scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th >pattern is a filler, and the 16th pattern means that the next byte >encodes the 254 next most common ideograms, the 255 bit pattern ............................... >this idea would also work for English. Assuming that the average >English word takes 6*8 bits (average length of 5 + terminating space >* 8 bit ascii) you could cut the disk space required for computer.. and I SAY, there is a reason: I would like to propose a criterion for ( or attribute of) coding of text. Coding is LOCAL (within n) if from each 3n bytes I can derive one (middle) letter of the encoded text. In this sense, the coding based on pairs (polish, spanish, sh for s^ etc are all local (within 2) but coding based on frequency of words is not (beside being language dependent). ( Please recall that I consider ideographs to be 'words' made of strokes.) The coding based on frequency of characters is Local, (and if we accept the above explained modifier-bit convention) also Language independent. I do believe that since we are discussing CHARACTER sets - we should leave out the coding based an dictionaries (word sets) - they have their funnction - but are much more (application, language, etc ) dependent than the character sets. Lets reach some agreement on letters first. Yours Dr. pom - a scientist - (quite mad) pom@under.s1.gov