Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!rutgers!cmcl2!lanl!opus!ted
From: ted@nmsu.edu (Ted Dunning)
Newsgroups: comp.ai.neural-nets
Subject: Re: Data Complexity
Message-ID: <TED.89Oct16113856@aigyptos.nmsu.edu>
Date: 16 Oct 89 17:38:56 GMT
References: <4542@imagen.UUCP>
Sender: news@nmsu.edu
Organization: NMSU Computer Science
Lines: 135
In-reply-to: ib@apolling's message of 15 Oct 89 22:08:08 GMT


ivan bach is very enthusiastic, but relatively uninformed about recent
developments in determining the complexity of finite sequences.

he also has an annoying tendency to post news that is more than 80
columns.

here is a partial commentary on his posting,

In article <4542@imagen.UUCP> ib@apolling (Ivan N. Bach) writes:

   Path: opus!lanl!cmcl2!nrl-cmf!think!ames!amdcad!sun!imagen!daemon
   From: ib@apolling (Ivan N. Bach)

   Claude Shannon, founder of the theory of information, ...  He
   decided to call this measure "entropy," because he realized that
   his formula:

	H = -K x sum of (pi x log pi) for all i=1 to i=a

shannon actually qualified this measure extensively.  it is only valid
if successive symbols are independent, and only valid for stationary
ergodic sequences.  none of these qualifications apply in most
biological systems.

   Lila Gatlin showed that Claude Shannon's formula could be used to
   calculate the entropy of a randomly chosen base in the DNA chain of
   bases in the nucleus of a living cell.  If all four types of bases
   in a DNA chain are equiprobable, the entropy of a randomly chosen
   base in the DNA chain is equal to: ... = 2 bits

actuallly lila gatlin showed that you can misuse shannon's formula to
get an upper bound on the amount of information in a dna chain.

   Computer programmers know that they have to use at least 2 bits to
   Cencode 4 different values.

real computer programmers know that this is not true.

   If all bases in a DNA chain are not equiprobable, the probability
   of a certain type of base occurring at a random position in a DNA
   chain is proportional to the number of bases of its type in the

	...  Therefore, the entropy of a randomly chosen base in the
   DNA chain of "Micrococcus lysodeikticus" is equal to:

	 = 1.87 bits

this is entirely wrong.  see below.

   Lila Gatlin came to the conclusion that the entropy of a complex
   system is equal to the sum of entropies of its components.

lila is again mistaken, unless what actually was said was that the
entropy of a system composed of _independent_ parts is the sum of the
entropies of its components.  of course _complex_ systems generally do
not allow such naive decomposition.

	[ arithmetic maunderings about bits and blueprints deleted ]

	[ fuzzy comparison of a turing machine to protein synthesis ]


   4. Instead of constructing a full-blown neural network, we could
   specify a "blueprint" for contructing that network.  I think that
   the amount of information in the blueprint can always be smaller
   than the amount of information in the network if there is any
   redundancy or repetition in the structure of the network.

actually the blueprint must be less than or equal to the network.
otherwise, we would just use the network as the blueprint.  see below
for a reasonable definition of information content.

   5. If you use a certain "alphaneural net, you should use each type
   of component with the same frequency if you want achieve the
   maximum information capacity.

of course maximum information _content_ (or capacity) would also be
obtained by making all components completely independent.  this is the
same as recommending that people use RAM for their neural net.  you
_don't_ want a system with maximum content.  you want a system that
encodes and carries meaning.  these are very different things.


----------------------------------------------------------------


shannon's formulae are simple enough to be misused extensively by all
kinds of people.  such misuse does not mean that they are any more
applicable than they were in 1949.  in particular, if you compute the
entropy of english text, you over-estimate the information content by
a factor of 4 or more.  for instance, this posting has an entropy of
4.6 bits per letter, while english is generally accepted to have an
information content of 1 bit per letter or less.  the difference is
entirely due to the assumption of independence between letters.

an even more dramatic example is a sequence formed by the regular
expression {abcd}*.  if we take any 100 characters from a random
location in such a sequence, and compute the entropy for the sequence,
we will find that we apparently have 200 bits of information.  in
fact, though, we have only 2 bits of information because once we know
the first letter in the sequence, we know all the letters.

a much more sound approach to the complexity of sequences was proposed
in the 60's by various soviet mathematicians, and was popularized for
use with chaotic systems by v.i. arnold in the 70's.  this model
defines the complexity of a finite string to be the size of the
smallest computer program which can be used to recreate the sequence.
in arnold's work, the sequence was the itinerary of a finite map, but
this does not really matter if we are starting with a sequence from a
finite alphabet.

obviously, the upper bound on the complexity of a finite sequence is
the sequence size itself, since we can easily write a program for, say,
a ram machine which builds a list containing the sequence in the same
number of steps as the original sequence is long.

the complexity of one of our 100 character substrings of {abcd}* is
much less, since we can write a very simple program which produces
one of these substrings.

in practice, information content of a complex sequence can be
estimated by building relatively simple models for inter-symbol
dependencies and then measuring the entropy of the errors the model
makes in predicting the next symbol.  this is nearly equivalent to the
approach used by arnold and co, since if you can't build a predictor
at all, then the error stream becomes equivalent to the original
sequence. 
--
ted@nmsu.edu
			Dem Dichter war so wohl daheime
			In Schildas teurem Eichenhain!
			Dort wob ich meine zarten Reime
			Aus Veilchenduft und Mondenschein