Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!rutgers!sri-spam!mordor!under!pom From: pom@under..ARPA (Peter O. Mikes) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <15350@mordor.s1.gov> Date: Thu, 13-Aug-87 15:29:34 EDT Article-I.D.: mordor.15350 Posted: Thu Aug 13 15:29:34 1987 Date-Received: Sat, 15-Aug-87 09:59:18 EDT Sender: news@mordor.s1.gov Reply-To: pom@s1-under.UUCP () Organization: S-1 Project, LLNL Lines: 100 Xref: mnetor comp.lang.c:3629 comp.std.internat:96 To: henry@utzoo.UUCP Subject: Re: What is a byte Newsgroups: comp.lang.c,comp.std.internat In-Reply-To: <8404@utzoo.UUCP> References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> Organization: S-1 Project, LLNL In article <8404@utzoo.UUCP> you ( Henry Spencer ) write: > >font changing, sigh...), so a lot of those bits are being wasted. Better >to use some sort of font-switch (etc.) sequences, simultaneously giving >more compact coding and more flexibility. >-- ^ This | is An IMPORTANT IDEA :-| . The so called 'Daisy Sort' - a sequence of characters on the printwheel is optimized - using the frequency if bigrams in English language - in such a manner that characters which are frequent neighbours are near to each other ( that makes for a faster printer). NOW, if I recall correctly, about 90% of movements are within ten-spokes-distance and (another statistical fact) the special symbols and capitals are so rare that their spacing is irrelevant ( except that digits tend to follow digits - so you place all digits next to each other) | | v => It is very wasteful to store English text using ASCII. ergo: There are really just few 'rational alternatives' for storing text: 1) 4 bits: sign + 3bit distance in the sort (of imaginary standard printwheel) with one code ( 0+000) being reserved to mean: The following 4-bit word has another meaning (namely : e.g long jump) or jump to another subset of the character set ( such as - switch the cases UPPER/lower, digits+aritmetics signs, carriage motion controls... 2) 6 bits: 1bit sign +1bit ( distance/font switch) + 4bit (either distance to next character within given sort or one of 16 other subfont sorts) 3) ... Naturally, languages such as c, would have a different statistics and should probably merit a special sort (which would be marked by a six? bit code on the beginning of the file/document (since unix command 'file' would not work {it does not work too well anyway} ) specifying the (type ot the file) = (apropriate daisy sort), such as: english text, numerical data, Post-Script file, c-source,... => It is ALSO very wastefull to store numerical data sets using ASCII. Of course, in the numerical_data character-subset we need characters for over-flow and undefined (NaN, Infinity, missing data point, end-of-file, end-of-row = end_of_vector, another-data-set .. and characters for decimal point/comma E and triplet-separator so that I can write 6_234,567 = 6_234.567 = 0.623_567_7E4 to mean six thousands and 234.567 ( The decimal comma (European way ) is preferred by the ISO SI standard, while decimal point (US way) is tolerated) and the (current ISO) triplet separator (namely blank i.e. 1 000 for one thousand ) MUST be changed ( since blank is used in parsing ). Perhaps 1_000=E3 (and 10 = E3.101 ?) or 1:000 = 1E3 (with / only, being used for division?) Actually, for speed of parsing it would be highly preferable to AVOID alphabetic separators (. and ,) and letters to express numbers. Perhaps we can write 3:456::4 to express three thousands and four hundred fifty six and four tenth and perhaps 1:+3 = 1:000 ( 1E3) and 5:-3 for .005 ( 5e-3) ? In any case, we should be able to express all numbers using sixteen digit-type characters: + - 0..9, ( decimal sign ) (exponent sign) (thats 13 or 14) and then perhaps ether | or { } for c-style sets, and ( one triplet separator) ( e.g. : or_ ( not blank ) We then can represent Infinity as ::: or +++ and NaN as +_+ etc Anyway, I just wanted to say, that Henry's pertinent reminder that character sets and grouping of characters into sets (or sub-fonts) affects compactness of information storage really points a way to an objective measure of suitability of different coding methods for different uses - and that several categories of use , namely 1) english text or just any plain text (i.e. prose), (4 or 6 bits) numerical data sets ( i.e. number or point sets) (4 bits) 3) c or just 'any programing language' 4) carriage motions ( tabs, form feeds, cursor addressing ,..?? modifiers ( highlight, underline, typography...) ?) ...?... are frequent enough and universal enough to merit their own character families or subfonts, binary representations and an international standard.