Xref: utzoo comp.ai.neural-nets:948 alt.cyb-sys:27 Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!uakari.primate.wisc.edu!aplcen!haven!uvaarpa!uvaee!aam9n From: aam9n@uvaee.ee.virginia.EDU (Ali Minai) Newsgroups: comp.ai.neural-nets,alt.cyb-sys Subject: Generalization Criteria Message-ID: <509@uvaee.ee.virginia.EDU> Date: 20 Sep 89 19:42:58 GMT Organization: EE Dept, U of Virginia, Charlottesville Lines: 52 A few days ago, I posted a question on generalization criteria, and the response has been very good. Several people sent me references, and I will e-mail a list of these to anyone who is interested. Just send me mail at aam9n@uvaee.ee.Virginia.EDU At this point, let me be a little more specific about my question. I am looking for a way to get away from the "test set" notion of measuring generalization. A "test set" is arbitrary, and any generalization measure it produces is good only relative to this set ---- and only partially so, because the "test set" itself is only a fragment of the data on which the estimator will be used. I realize that it is impossible to get away from this relativity altogether. All measures must be relative to something or the other. In the case of particular applications, it makes sense to use real test sets to check estimator performance, but in cases of purely theoretical interest, it might be more appropriate to use *internal* measures of performance. By "internal" I mean measures which use only the parameters of the estimator itself, or, at least, are not based on arbitrarily chosen sets of extraneous data. One good measure which I have already seen is Rissanen's minimum length criterion. Other entropy related measures are also out there, and it is mainly on these that I seek information. Of course, if other novel measures have been used, I will be more than interested. Here is a very simplified example of the kind of problem I am looking at: Two networks A and B are trained on a set of two points in (x,y)-space. Network A, which is very small, learns a straight line. Netwrork B, which is more complex, learns a sinusoid passing through the points. I, on the other hand, was actually looking for a parabola (I was stupid, okay! but this is only an example). I test my networks on test points from my parabola, and both of them fail miserably. Why should either of them be considered inadequate though? I had not given them enough information. But this objection applies only relative to my desire ---- to learn a parabola. If I remove this desire, there is no such thing as "enough" information. Whatever is there is all there is. So how do I grade the performance of my networks now? Do I say that network A is "better" because it produced a "simpler" model? But that will get me into the debate we have already been through on whether a polynomial of degree 4 is simpler an a simple exponential, and so forth. Thus, one aspect of my question is, have people tried to use criteria for assigning a degree of "simplicity" to functions? For example, if I had been using regular samples on a time signal, I could use Shannon's theorem and say that any generalization involving frequencies higher than the Shannon rate is a "false" one, and has "generated" information gratuitously. Given arbitrary samples, is there a natural extension? Thank you, Ali Minai aam9n@uvaee.ee.Virginia.EDU