Xref: utzoo comp.ai.neural-nets:947 alt.cyb-sys:26 Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!wuarchive!gem.mps.ohio-state.edu!rpi!leah!bingvaxu!cybsys From: cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator) Newsgroups: comp.ai.neural-nets,alt.cyb-sys Subject: Re: Generalization Criteria Message-ID: <2449@bingvaxu.cc.binghamton.edu> Date: 20 Sep 89 12:49:58 GMT References: <506@uvaee.ee.virginia.EDU> <2448@bingvaxu.cc.binghamton.edu> Reply-To: cybsys@bingvaxu.cc.binghamton.edu (CYBSYS-L Moderator) Organization: SUNY Binghamton, NY Lines: 35 Really-From: Alex Martelli Training set and test set are not necessarily separate, i.e. you may be using parts of the training set to look for "convergency" or generality of the trained model. E.g. assume that you have a system which accepts a stream of data points, building an internal model such that it gives a probability distribution for "value of next point forthcoming": as the next point comes, first it evaluates its likelihood and outputs a "current generality" measure, such as running perplexity for the last 10 points, then it uses the value to update the training of the internal model. This is not philosophically very different from training/test set duality, but it can be very practical, since you judge model convergency to generalization, not from a single number measuring generality, but from a time-varying measure of goodness of fit with training-data-before-the-model-was-trained-on-them, and it may be possible to use characteristics of this curve (it is flattening out to some plateau? is it going down steadily after a peak, thus I am overtraining my model? etc) to decide when training can stop. I was introduced to this approach by Dr. Federik Jelinek of IBM Research at Yorktown and his group, and I believe it is expounded in their publications on IEEE Journal of Acoustics, Speech and Signal Processing on speech recognition and language modeling. A philosophically different approach is to evaluate the amount of information, as extracted from training data, that should be communicated to a hypothetical receiver armed only with the bare structure of the model to allow the model-as-trained to be reconstructed; weighing this amount of information against the information needed to communicate the data themselves given the model-as-trained (say with arithmetic or with Huffman coding). This also assumes a probabilistic model of course, and like the first approach it has a strong flavor of information theory. This idea was mentioned to me in a private note by Dr. Rissanen of IBM Research at Almaden, but I don't know whether it has actually been tried or it is "just" an idea; you could try contacting Dr. Rissanen himself about this.