Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!rutgers!sri-spam!mordor!under!pom
From: pom@under..ARPA (Peter O. Mikes)
Newsgroups: comp.lang.c,comp.std.internat
Subject: Re: What is a byte
Message-ID: <15350@mordor.s1.gov>
Date: Thu, 13-Aug-87 15:29:34 EDT
Article-I.D.: mordor.15350
Posted: Thu Aug 13 15:29:34 1987
Date-Received: Sat, 15-Aug-87 09:59:18 EDT
Sender: news@mordor.s1.gov
Reply-To: pom@s1-under.UUCP ()
Organization: S-1 Project, LLNL
Lines: 100
Xref: mnetor comp.lang.c:3629 comp.std.internat:96

To: henry@utzoo.UUCP
Subject: Re: What is a byte
Newsgroups: comp.lang.c,comp.std.internat
In-Reply-To: <8404@utzoo.UUCP>
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP>
Organization: S-1 Project, LLNL
 
In article <8404@utzoo.UUCP> you  ( Henry Spencer ) write:
>
>font changing, sigh...), so a lot of those bits are being wasted.  Better
>to use some sort of font-switch (etc.) sequences, simultaneously giving
>more compact coding and more flexibility.
>--    ^ 
  This | is An IMPORTANT IDEA  :-| . The so called 'Daisy Sort' - a sequence 
  of  characters on the printwheel is optimized - using the frequency 
  if bigrams in English language - in such a manner that characters which 
  are frequent neighbours are near to each other ( that makes for a faster 
  printer). NOW, if I recall correctly, about 90%  of movements are within  
  ten-spokes-distance  and (another statistical fact) the special symbols 
  and capitals are so rare that their spacing is irrelevant ( except that 
  digits tend to follow digits - so you place all digits next to each other)
  | |
   v
    => It is very wasteful to store English text using ASCII. 
   
  ergo:  
        There are really just few 'rational alternatives' for storing text: 
         
    1)  4 bits: sign + 3bit distance in the sort (of imaginary standard  
                                                   printwheel)   
          with one code ( 0+000) being reserved to mean: The following                    4-bit word has another meaning (namely : e.g long jump) or jump to 
          another subset of the character set ( such as - switch the cases
          UPPER/lower,  digits+aritmetics signs, carriage motion controls... 
           
   2) 6 bits: 1bit sign +1bit ( distance/font switch) + 4bit (either distance   
        to next character within given sort or one of 16 other subfont sorts) 
  
   3) ...


	Naturally, languages such as c, would have a different statistics
  and should probably merit a special sort (which would be marked by a six?
  bit code on the beginning of the file/document (since unix command 'file'
  would not work {it does not work too well anyway} ) specifying the 
 (type ot the file) = (apropriate daisy sort), such as: english text,
  numerical data, Post-Script file, c-source,...

    => It is ALSO very wastefull to store numerical data sets using ASCII. 

     Of course, in the numerical_data character-subset  we need characters
  for over-flow and undefined (NaN, Infinity, missing data point, end-of-file,
  end-of-row = end_of_vector, another-data-set ..
  and characters for decimal point/comma  E and triplet-separator so that
  I can write 
 
   6_234,567 = 6_234.567  = 0.623_567_7E4 to mean

  six thousands and  234.567  ( The decimal comma  (European way ) is preferred
  by the ISO SI standard, while decimal point (US way) is tolerated) 
  and the  (current ISO) triplet separator (namely blank i.e. 1 000  for
  one thousand ) MUST be changed ( since blank is used in parsing ). 

 Perhaps  1_000=E3 (and 10 = E3.101  ?) or 1:000 = 1E3 
                                    (with / only, being used for division?)
   
 Actually, for speed of parsing it would be highly preferable to AVOID
 alphabetic separators (. and ,) and letters to express numbers. 

  Perhaps we can write   3:456::4    to express
     three thousands and  four hundred fifty six and  four tenth 

    and perhaps  1:+3  = 1:000 ( 1E3) and 5:-3 for  .005  ( 5e-3) ?

In any case, we should be able to express all numbers using sixteen digit-type
    characters:  + -   0..9, ( decimal sign ) (exponent sign) (thats 13 or 14)
    and then perhaps  ether | or { } for c-style sets, 
    and ( one triplet separator) ( e.g. : or_     ( not blank )
    We then can represent   Infinity as ::: or +++ and NaN as +_+  etc

  Anyway, I just wanted to say, that Henry's  pertinent reminder that 
  character sets and grouping of characters into sets (or sub-fonts)
  affects compactness of information storage really points a way to an
  objective measure of suitability of different coding  methods for different
  uses - and that several categories  of use , namely 

1) english text or just any plain text  (i.e. prose),   (4 or 6 bits)
   numerical data sets ( i.e. number or point sets)     (4 bits)
3)  c or just 'any programing language' 
4)  carriage motions ( tabs, form feeds, cursor addressing ,..??
    modifiers ( highlight, underline, typography...)
?)  ...?...
 

    are frequent enough and universal enough  to merit their own 
    character families or subfonts, binary representations
    and an international standard.