Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!seismo!rutgers!sri-unix!sri-spam!mordor!under!pom
From: pom@under..ARPA (Peter O. Mikes)
Newsgroups: sci.lang,comp.std.internat
Subject: Character representation
Message-ID: <15381@mordor.s1.gov>
Date: Tue, 18-Aug-87 19:41:46 EDT
Article-I.D.: mordor.15381
Posted: Tue Aug 18 19:41:46 1987
Date-Received: Thu, 20-Aug-87 06:13:52 EDT
Sender: news@mordor.s1.gov
Reply-To: pom@s1-under.UUCP ()
Organization: S-1 Project, LLNL
Lines: 75
Xref: mnetor sci.lang:1167 comp.std.internat:119

To: gordan@maccs.UUCP
Subject: Re: Character representation
Newsgroups: comp.std.internat, sci.lang
In-Reply-To: <719@maccs.UUCP>

In article <719@maccs.UUCP> you write:
                            >Followups to alt.universes.
  I am sorry, but according to latest QM, the multiple universes not
  only keep splitting, they also merge. This happens to be one such
 feedback from Alternative Universe. Besides I have VERY CONSTRUCTIVE
(  insiders info ) FACT on ceddilas, umlats, haceks, and other such ...
 [modifiers] namely : In all langauges I know, there are many kinds,
 but ANY PARTICULAR LETTER either has one - or it does not. That means
 that we need to reserve just 1 bit ( 0.. unmodified) and (1.. modified).
  to take care of dozens of languages.
 So e.g. if switch  ( ROM, printwheel,..) is set to German , modified o will
 put two dots ( umlaut) above o; In Czech the same bit will put ' above
  'aeiou' but will put inverted ^ over consonants  ( since only 'aeiou'
  are allowed to have  '  and only consonants can  have ^, and so it goes.

>>But this doesn't
>>address all problems I mentioned. How to construct a general character
>>with an arbitrary accent, umlaut or other diacritic mark? An 8-bit
>>enumarate isn't sufficient.

 The problem you  (somebody) mentioned is hereby addressed.
 To disprove my conjecture, name one language with Latin-based alphabet 
 and one letter in that alphabet, which admits more then one modifier.

   Oh, just BTW - using poor ASCII, which has no modifier bit, I am
   using the convention that modifier is indictaed by h ( e.g. 
  a word:  (modified_s)ot would appear as shot. (which is quite wastefull
  as whole h is needed to perform function of one bit).
 
 ( I am not quite sure if all mono-anglo-phones realise that english is
 actually using pairs for sounds ( english sh  is perverse Hungarian's sz
 is actualy one sound (soft s or s^). The difference  is mostly in that
 english is ambiguous and arbitrary and (on the positive side) makes
 collating based on singles ( but anybody can accept that, since you
 get your pairs sz - sorted in same sequence (almost) always anyway.

.  There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
 
 In a related posting 
>--- David Phillip Oster            --My Good News: "I'm a perfectionist."
>Arpa: oster@dewey.soe.berkeley.edu --My Bad News: "I don't charge by the hour."
                  WRITES
> There is no reason why we couldn't use a huffman encoding
>scheme: the 14 most common ideograms fit in a 4 bit nybble, the 15th
>pattern is a filler, and the 16th pattern means that the next byte
>encodes the 254 next most common ideograms, the 255 bit pattern
    ...............................
>this idea would also work for English. Assuming that the average
>English word takes 6*8 bits (average length of 5 + terminating space
>* 8 bit ascii) you could cut the disk space required for computer..
   and I SAY, there is a reason: 	
  I would like to propose a criterion  for ( or attribute of) coding of
 text. Coding  is LOCAL (within n) if from each 3n bytes I can derive
 one (middle) letter of the encoded text. In this sense, the  coding
 based on pairs (polish, spanish, sh for s^ etc are all local (within 2)
 but coding based on frequency of words is not (beside being language
 dependent).  ( Please recall that I consider ideographs to be 'words' made
 of strokes.)
 The coding based on frequency of characters is Local, (and if we accept
 the above explained modifier-bit convention) also Language independent.

 I do believe that since we are discussing CHARACTER sets - we should 
 leave out the coding based an dictionaries (word sets) - they have their
 funnction - but are much more (application, language, etc ) dependent
 than the character sets. Lets reach some agreement on letters first.
 

	Yours  Dr. pom  -  a scientist  -   (quite mad)   pom@under.s1.gov