Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!sri-spam!mordor!under!pom From: pom@under.UUCP Newsgroups: comp.std.internat Subject: generalised alphabets (standart coding of) Message-ID: <15534@mordor.s1.gov> Date: Tue, 1-Sep-87 13:00:53 EDT Article-I.D.: mordor.15534 Posted: Tue Sep 1 13:00:53 1987 Date-Received: Wed, 2-Sep-87 07:21:51 EDT Sender: news@mordor.s1.gov Reply-To: pom@s1-under.UUCP () Organization: S-1 Project, LLNL Lines: 78 Subject: Re: generalised alphabets Newsgroups: sci.lang,comp.std.internat References: <15488@mordor.s1.gov> <1209@pdn.UUCP> In article <1296@houdi.UUCP> you write: >In article <1209@pdn.UUCP>, alan@pdn.UUCP (Alan Lovejoy) writes: >> A Proposal: >> >> If every letter for any human alphabet, and every ideograph, were given >> a unique 32-bit (for example) id number, it would be possible to create >> a 'character look-up table' ... >> >> Also, the possible human speech sounds should be given unique id >> numbers (is 32 bits sufficient?), .... > >I'm even less certain about the representation of speech sounds. The >possible positions of the organs of articulation are continuously Yes, the char.map in analogy to color.map is a good proposal; It complements my earlier proposal for creating a standart mechanism for coding of the national alphabets, rather then forcing same set of characters on everybody. Combining several proposals and ideas (and borowing also Device Independence from computer graphics) we get following: Certain applications require large data sets ( files ) to be interchanged between a) different machines and b)locations; In many cases these files can be represented by a sequence of small number (lets say N) of symbols (generalised characters). When we ignore digram frequencies, we need log2 N bits * L per file of Length L ( L symbols ); When frequency of bigram is exploited the number of bits decreases to M% of that. Lets call this log2(N) K1 and M*K1/100 K2 and K is either K1 or K2. Little bit of info for trivia lovers: Mr.Markov formulated his concept of Markov Chains while studying statistics of Russian language. Than Shanon gives M for english. What is it? We need agreed on mechanism, by which we tell to R (receiving computer): In the following stream of bits, take sets of K bits and interpret them using character lookup table T342 (for example) Interpret can , for example, mean the following: If the reciever is a) "an ASCII only" printer, which allows overstrikes then represent symbol i (one of N ) by this set of strikes, (e.g. A^ will be accented A ) b) a printer with 'apropriate' printweel (i.e. daisy with N spokes) represent the symbol i by single strike of spoke j(i).. c) if reciever is bit-mapped CRT, use graphic image stored in /user/public/T342 or defined by following (cgi) graphic primitives ... which can be scaled, skewed into cursive, underlined, capitalised,... etc ( It is patently wastefull to give a bit for any of these, since if you capitalise, it is OFTEN one char or a long sequence. The M and K2 introduced above makes this aspect quantitative and general) ............ etc.( char.map can include collating sequence and (as special Reciever) vocalisation ( that's more complex then just one-to-one phonem(char) but it can be coded within SAME framework as 'one of N phonems'. This covers all and any national alohabets, c sources AND IMPORTANTly the numerical data sets. I got an objection when I proposed as one special set of N=16 to be (generalised) digits ( i.e. 0,1...9, + - : (as triplet separator) EOF, etc.. The objection said: but we can do it in ASCII and we do not want to complicate this. My objection to the objection is as follow: Often we do it in ASCII, but not always: I worked on a "large-scale numerical simulation project " which had to ship Megabytes from Cray to Iris worksation (for display). ASCII was too slow, so extra coding was needed to interpret binary files. These applications will not go away, there will be more and more of them and we do not want to go to binaries - that's what #SCII should do, WITHOUT forcing me to ship a bit for case (capital, low case) with each symbol - thats absurd ( in many applications). pom@under.s1.gov || @s1-under.UUCP