Xref: utzoo comp.ai.neural-nets:947 alt.cyb-sys:26
Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!gem.mps.ohio-state.edu!rpi!leah!bingvaxu!cybsys
From: cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator)
Newsgroups: comp.ai.neural-nets,alt.cyb-sys
Subject: Re: Generalization Criteria
Message-ID: <2449@bingvaxu.cc.binghamton.edu>
Date: 20 Sep 89 12:49:58 GMT
References: <506@uvaee.ee.virginia.EDU> <2448@bingvaxu.cc.binghamton.edu>
Reply-To: cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator)
Organization: SUNY Binghamton, NY
Lines: 35

Really-From: Alex Martelli <AIBM002@ICINECA.BITNET>

     Training set and test set are not necessarily separate, i.e. you
may be using parts of the training set to look for "convergency" or
generality of the trained model.  E.g. assume that you have a system
which accepts a stream of data points, building an internal model such
that it gives a probability distribution for "value of next point
forthcoming": as the next point comes, first it evaluates its likelihood
and outputs a "current generality" measure, such as running perplexity
for the last 10 points, then it uses the value to update the training
of the internal model.  This is not philosophically very different
from training/test set duality, but it can be very practical, since
you judge model convergency to generalization, not from a single number
measuring generality, but from a time-varying measure of goodness of
fit with training-data-before-the-model-was-trained-on-them, and it may
be possible to use characteristics of this curve (it is flattening out
to some plateau?  is it going down steadily after a peak, thus I am
overtraining my model?  etc) to decide when training can stop.
I was introduced to this approach by Dr. Federik Jelinek of IBM Research
at Yorktown and his group, and I believe it is expounded in their
publications on IEEE Journal of Acoustics, Speech and Signal Processing
on speech recognition and language modeling.

A philosophically different approach is to evaluate the amount of
information, as extracted from training data, that should be communicated
to a hypothetical receiver armed only with the bare structure of the
model to allow the model-as-trained to be reconstructed; weighing this
amount of information against the information needed to communicate the
data themselves given the model-as-trained (say with arithmetic or with
Huffman coding).  This also assumes a probabilistic model of course, and
like the first approach it has a strong flavor of information theory.
This idea was mentioned to me in a private note by Dr. Rissanen of IBM
Research at Almaden, but I don't know whether it has actually been tried
or it is "just" an idea; you could try contacting Dr. Rissanen himself
about this.