Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!rutgers!cmcl2!lanl!opus!ted From: ted@nmsu.edu (Ted Dunning) Newsgroups: comp.ai.neural-nets Subject: Re: Data Complexity Message-ID: Date: 16 Oct 89 17:38:56 GMT References: <4542@imagen.UUCP> Sender: news@nmsu.edu Organization: NMSU Computer Science Lines: 135 In-reply-to: ib@apolling's message of 15 Oct 89 22:08:08 GMT ivan bach is very enthusiastic, but relatively uninformed about recent developments in determining the complexity of finite sequences. he also has an annoying tendency to post news that is more than 80 columns. here is a partial commentary on his posting, In article <4542@imagen.UUCP> ib@apolling (Ivan N. Bach) writes: Path: opus!lanl!cmcl2!nrl-cmf!think!ames!amdcad!sun!imagen!daemon From: ib@apolling (Ivan N. Bach) Claude Shannon, founder of the theory of information, ... He decided to call this measure "entropy," because he realized that his formula: H = -K x sum of (pi x log pi) for all i=1 to i=a shannon actually qualified this measure extensively. it is only valid if successive symbols are independent, and only valid for stationary ergodic sequences. none of these qualifications apply in most biological systems. Lila Gatlin showed that Claude Shannon's formula could be used to calculate the entropy of a randomly chosen base in the DNA chain of bases in the nucleus of a living cell. If all four types of bases in a DNA chain are equiprobable, the entropy of a randomly chosen base in the DNA chain is equal to: ... = 2 bits actuallly lila gatlin showed that you can misuse shannon's formula to get an upper bound on the amount of information in a dna chain. Computer programmers know that they have to use at least 2 bits to Cencode 4 different values. real computer programmers know that this is not true. If all bases in a DNA chain are not equiprobable, the probability of a certain type of base occurring at a random position in a DNA chain is proportional to the number of bases of its type in the ... Therefore, the entropy of a randomly chosen base in the DNA chain of "Micrococcus lysodeikticus" is equal to: = 1.87 bits this is entirely wrong. see below. Lila Gatlin came to the conclusion that the entropy of a complex system is equal to the sum of entropies of its components. lila is again mistaken, unless what actually was said was that the entropy of a system composed of _independent_ parts is the sum of the entropies of its components. of course _complex_ systems generally do not allow such naive decomposition. [ arithmetic maunderings about bits and blueprints deleted ] [ fuzzy comparison of a turing machine to protein synthesis ] 4. Instead of constructing a full-blown neural network, we could specify a "blueprint" for contructing that network. I think that the amount of information in the blueprint can always be smaller than the amount of information in the network if there is any redundancy or repetition in the structure of the network. actually the blueprint must be less than or equal to the network. otherwise, we would just use the network as the blueprint. see below for a reasonable definition of information content. 5. If you use a certain "alphaneural net, you should use each type of component with the same frequency if you want achieve the maximum information capacity. of course maximum information _content_ (or capacity) would also be obtained by making all components completely independent. this is the same as recommending that people use RAM for their neural net. you _don't_ want a system with maximum content. you want a system that encodes and carries meaning. these are very different things. ---------------------------------------------------------------- shannon's formulae are simple enough to be misused extensively by all kinds of people. such misuse does not mean that they are any more applicable than they were in 1949. in particular, if you compute the entropy of english text, you over-estimate the information content by a factor of 4 or more. for instance, this posting has an entropy of 4.6 bits per letter, while english is generally accepted to have an information content of 1 bit per letter or less. the difference is entirely due to the assumption of independence between letters. an even more dramatic example is a sequence formed by the regular expression {abcd}*. if we take any 100 characters from a random location in such a sequence, and compute the entropy for the sequence, we will find that we apparently have 200 bits of information. in fact, though, we have only 2 bits of information because once we know the first letter in the sequence, we know all the letters. a much more sound approach to the complexity of sequences was proposed in the 60's by various soviet mathematicians, and was popularized for use with chaotic systems by v.i. arnold in the 70's. this model defines the complexity of a finite string to be the size of the smallest computer program which can be used to recreate the sequence. in arnold's work, the sequence was the itinerary of a finite map, but this does not really matter if we are starting with a sequence from a finite alphabet. obviously, the upper bound on the complexity of a finite sequence is the sequence size itself, since we can easily write a program for, say, a ram machine which builds a list containing the sequence in the same number of steps as the original sequence is long. the complexity of one of our 100 character substrings of {abcd}* is much less, since we can write a very simple program which produces one of these substrings. in practice, information content of a complex sequence can be estimated by building relatively simple models for inter-symbol dependencies and then measuring the entropy of the errors the model makes in predicting the next symbol. this is nearly equivalent to the approach used by arnold and co, since if you can't build a predictor at all, then the error stream becomes equivalent to the original sequence. -- ted@nmsu.edu Dem Dichter war so wohl daheime In Schildas teurem Eichenhain! Dort wob ich meine zarten Reime Aus Veilchenduft und Mondenschein