Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!spool.mu.edu!munnari.oz.au!mel.dit.csiro.au!latcs1!sietsma
From: sietsma@latcs1.oz.au (Jocelyn Sietsma Penington)
Newsgroups: comp.ai.neural-nets
Subject: Re: generalization in NN's
Keywords: ldf generalization
Message-ID: <9881@latcs1.oz.au>
Date: 3 Apr 91 05:04:33 GMT
Article-I.D.: latcs1.9881
References: <1991Apr2.205240.24668@milton.u.washington.edu>
Reply-To: sietsma@latcs1.oz.au (Jocelyn Sietsma Penington)
Organization: Comp Sci, La Trobe Uni, Australia
Lines: 54

In article <1991Apr2.205240.24668@milton.u.washington.edu> nealiphc@milton.u.washington.edu (Phillip Neal) writes:
>I have a problem with the ability of a neural net to generalize.
...
>I break the data into a 400 observation training set and
>a 200 observation test set.
...  [NN does better on training set than linear discr. fn., but poorer on
      test set]
>And no matter how long I let the NN run, and no matter what
>number of hidden layer nodes, I always get about the same 
>results.
>
>I know I am violating the rule of thumb to have 10 times more
>training data than nodes in the net. But hey, data is expensive.

For starters, I think the rule of thumb quoted above is nonsense - it
doesn't take any notice of the characteristics of your data.  I think
it was calculated for training random inputs to random outputs, and who
wants to do that? 

The problem here may well be that you are actually training too long.
See the paper by Chauvin in NIPS 2, or by Weigend, Huberman and Rumelhart
(Predicting the future: a connect'st approach - Stanford-PDP-90-01, to 
appear in Int'l J. of Neural Systems) for graphs showing that as training 
continues, performance on the training set continuously improves, but 
performance on the test set reaches a maximum and then declines.

Unfortunately the only cures I know are expensive, either in data or time.

1. You can split your data set in three: training, cross-validation and testing.
Train, periodically checking the error rate on the cross-validation set.
When this starts to rise, stop training.  Use the test set to find the true
generalization performance.

2. You can reduce the effective size of your network.  The 2 papers I referenced
above are about adding an extra cost term to the standard back-prop of errors
to encourage the network to eliminate unnecessary units or connections.  This
appears to prevent the overtraining problem.  Unfortunately it greatly 
increases time required for training, and getting the parameter values right
might be difficult. (I haven't tried these, so I don't know.)

2b.  You MIGHT get some improvement by taking your trained network as it now
and removing any redundant units by one of the available pruning methods.  On
a toy problem, I have found that this improves generalization.  (Sietsma & Dow
Neural Networks 1991)  See Mozer & Smolensky, NIPS 1, and Le Cun, Denker & 
Solla, NIPS 2, for alternate methods of pruning trained networks.

hope this helps,

Jocelyn
-- 
(Jocelyn Penington, a.k.a. Sietsma - feel free to use either)
Email: sietsma@LATCS1.oz.au            Address: Materials Research Laboratory
Phone: (03) 319 3775 or (03) 479 1057           PO Box 50, Melbourne 3032
This article does not commit me, LaTrobe Uni or M.R.L. to any act or opinion.