Xref: utzoo comp.ai.neural-nets:948 alt.cyb-sys:27
Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!uakari.primate.wisc.edu!aplcen!haven!uvaarpa!uvaee!aam9n
From: aam9n@uvaee.ee.virginia.EDU (Ali Minai)
Newsgroups: comp.ai.neural-nets,alt.cyb-sys
Subject: Generalization Criteria
Message-ID: <509@uvaee.ee.virginia.EDU>
Date: 20 Sep 89 19:42:58 GMT
Organization: EE Dept, U of Virginia, Charlottesville
Lines: 52


A few days ago, I posted a question on generalization criteria, and the
response has been very good. Several people sent me references, and I
will e-mail a list of these to anyone who is interested. Just send me
mail at
        aam9n@uvaee.ee.Virginia.EDU

At this point, let me be a little more specific about my question. I
am looking for a way to get away from the "test set" notion of measuring
generalization. A "test set" is arbitrary, and any generalization measure
it produces is good only relative to this set ---- and only partially so,
because the "test set" itself is only a fragment of the data on which
the estimator will be used. I realize that it is impossible to get away
from this relativity altogether. All measures must be relative to something
or the other. In the case of particular applications, it makes sense to
use real test sets to check estimator performance, but in cases of purely
theoretical interest, it might be more appropriate to use *internal*
measures of performance. By "internal" I mean measures which use only
the parameters of the estimator itself, or, at least, are not based on
arbitrarily chosen sets of extraneous data. One good measure which I
have already seen is Rissanen's minimum length criterion. Other entropy
related measures are also out there, and it is mainly on these that
I seek information. Of course, if other novel measures have been used,
I will be more than interested.

Here is a  very simplified example of the kind of problem I am looking at:

Two networks A and B are trained on a set of two points
in (x,y)-space. Network A, which is very small, learns a straight line.
Netwrork B, which is more complex, learns a sinusoid passing through the
points. I, on the other hand, was actually looking for a parabola (I
was stupid, okay! but this is only an example). I test my networks on
test points from my parabola, and both of them fail miserably. Why should
either of them be considered inadequate though? I had not given them enough
information. But this objection applies only relative to my desire ---- to
learn a parabola. If I remove this desire, there is no such thing as "enough"
information. Whatever is there is all there is. So how do I grade the
performance of my networks now? Do I say that network A is "better"
because it produced a "simpler" model? But that will get me into the
debate we have already been through on whether a polynomial of degree 4
is simpler an a simple exponential, and so forth. Thus, one aspect of
my question is, have people tried to use criteria for assigning a degree
of "simplicity" to functions? For example, if I had been using regular
samples on a time signal, I could use Shannon's theorem and say that
any generalization involving frequencies higher than the Shannon rate
is a "false" one, and has "generated" information gratuitously.
Given arbitrary samples, is there a natural extension?

Thank you,

Ali Minai
aam9n@uvaee.ee.Virginia.EDU