Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!sri-spam!mordor!under!pom
From: pom@under.UUCP
Newsgroups: comp.std.internat
Subject: generalised alphabets (standart coding of)
Message-ID: <15534@mordor.s1.gov>
Date: Tue, 1-Sep-87 13:00:53 EDT
Article-I.D.: mordor.15534
Posted: Tue Sep  1 13:00:53 1987
Date-Received: Wed, 2-Sep-87 07:21:51 EDT
Sender: news@mordor.s1.gov
Reply-To: pom@s1-under.UUCP ()
Organization: S-1 Project, LLNL
Lines: 78

Subject: Re: generalised alphabets
Newsgroups: sci.lang,comp.std.internat
References: <15488@mordor.s1.gov> <1209@pdn.UUCP>

In article <1296@houdi.UUCP> you write:
>In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes:
>> A Proposal:
>> 
>> If every letter for any human alphabet, and every ideograph, were given
>> a unique 32-bit (for example) id number, it would be possible to create
>> a 'character look-up table' ...
>> 
>> Also, the possible human speech sounds should be given unique id
>> numbers (is 32 bits sufficient?), ....
>
>I'm even less certain about the representation of speech sounds.  The
>possible positions of the organs of articulation are continuously

	Yes, the char.map in analogy to color.map is a good proposal;
 It complements my earlier proposal for creating a standart mechanism
 for coding of the national alphabets, rather then forcing same set
 of characters on everybody. Combining several proposals and ideas
 (and borowing also Device Independence from computer graphics) we 
 get following:
	Certain applications require large data sets ( files ) to be
	interchanged between a) different machines and b)locations;
	In many cases these files can be represented by a  sequence
	of small number (lets say N) of symbols (generalised characters).
        When we ignore digram frequencies, we need log2 N bits * L
	per file of Length L ( L symbols ); When frequency of bigram
	is exploited the number of bits decreases to M% of that.
	Lets call this log2(N) K1 and M*K1/100 K2 and K is either K1 or K2.
  Little bit of info for trivia lovers: Mr.Markov formulated his
  concept of Markov Chains while studying statistics of Russian 
  language. Than Shanon gives M for english. What is it?
      
      We need agreed on mechanism, by which we tell to R (receiving
      computer):  In the following stream of bits, take sets of K 
      bits and interpret them using character lookup table T342 (for example) 

      Interpret can , for example,  mean the following:
      If the reciever is  a)  "an ASCII only" printer, which
      allows overstrikes  then represent symbol i (one of N ) by
      this set of strikes, (e.g. A^ will be accented A )
			 b) a printer with 'apropriate'
     printweel (i.e. daisy with N spokes) represent the
     symbol i by single strike of spoke j(i)..
			c) if reciever is bit-mapped CRT,
    use graphic image stored in /user/public/T342 or defined
    by following (cgi) graphic primitives ... which can be
    scaled, skewed into cursive, underlined, capitalised,... etc
    ( It is patently wastefull to give a bit for any of these, since
    if you capitalise, it is OFTEN one char or a long sequence. The
    M and K2 introduced above makes this aspect quantitative and general)

	............ etc.( char.map can include collating sequence
	and (as special Reciever) vocalisation   (  that's 
	more complex then just one-to-one phonem(char) but it
	can be coded within SAME framework as 'one of N phonems'.

	This covers all and any national alohabets, c sources AND
	IMPORTANTly the numerical data sets. I got an objection
	when I proposed as one special set of N=16 to be (generalised)
	digits ( i.e. 0,1...9, + - : (as triplet separator) EOF, etc..
	The objection said: but we can do it in ASCII and we do not
	want to complicate this.
	My objection to the objection is as follow: Often we do it
	in ASCII, but not always: I worked on a "large-scale numerical
	simulation project " which had to ship Megabytes from Cray to
	Iris worksation (for display). ASCII was too slow, so extra
	coding was needed to interpret binary files. These applications
	will not go away, there will be more and more of them and we
	do not want to go to binaries - that's what #SCII should do,
	WITHOUT forcing me to ship a bit for case (capital, low case)
	with each symbol - thats absurd ( in many  applications).


                                pom@under.s1.gov ||  @s1-under.UUCP