Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!agate!shelby!portia.stanford.edu!whirlwind.Stanford.EDU!wan
From: wan@whirlwind.Stanford.EDU (Eric A. Wan)
Newsgroups: comp.ai.neural-nets
Subject: Re: Backpropagation with Newton's Method, and recurrence.  Source code.
Message-ID: <1990Nov21.213353.16460@portia.Stanford.EDU>
Date: 21 Nov 90 21:33:53 GMT
References: <7607@uwm.edu> <1248@helens.Stanford.EDU> <7684@uwm.edu>
Sender: news@portia.Stanford.EDU (USENET News System)
Organization: Stanford University, California, USA
Lines: 92

In article <7684@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes:
>In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes:
>>You mention the method does not work well for f(x) = x^2.  In fact, Newton's
>>Method applied to the gradient of x^2 converges in one step.
>
>You lost me here.  Try using Newton's method to find a zero of x^2.  This is
>what I was describing.  It has the same (relatively poor) performance as
>binary search:
>
>		  x[n+1] = x[n] -  x[n]^2/(2*x[n]) = 1/2 * x[n].
>

I think you still don't quite understand what we mean.  Near the
current value of X_k look at the first two terms of the Taylor
expansion:

    f(X) = f(X_k) + grad(f)(X-X_k) + (1/2)[(X-X_k)^T]*H*(X-X_k) + O(X^3)

Now take the gradient of the equation (dropping higher order terms)

   grad(f) = grad(f(X_k)) + H*(X-X_k) 

Using Newton's to find the zero of the gradient (which corresponds to
a minimum if H is positive definite) yields the following algorithm

    X_{k+1) = X_k - H^(-1)*grad(f)      (1)

This is the version of Newton's Method for minimizing a function found
in most optimization texts.

Now consider the example of f(x) = x^2.  (grad = 2x, H = grad(grad) = 2)
From equation (1) with any initial value of x = x_0
 
   x_1 = x_0 - (1/2)*2x_0 = x_0 - x_0 = 0

One step convergence to the minimum!  For any general hyperdimensional
quadratic one step convergence is also attained (not so if you search
for a zero). 
   
>You have to concede: it's a valid point to say that the goal is to find a zero
>or near-zero of the error function.  It's no use finding any minima if they
>aren't 'near' zero.  And if there aren't any minima AT ALL near a zero on the
>surface, then how could it even make sense to talk about convergence in the
>first place?  A "solution" with a significant error is simply not a solution,
>even if it's "optimum".

Sorry, but we don't agree.  The goal of training neural networks is
finding a minimum which doesn't have to be a zero.  For a simple
example, suppose you want to classify points drawn from two highly
overlapping Gaussian distributions.  In this case the optimal solution
will have a large error.  If the network were to have a large number
of weights, you could cause it to classify the points in your training
set perfectly, and the error of the trained system would be very close
to zero.  Such a network, however, would perform very poorly outside
the training set because you are overfitting the training data.  In
order for good generalization in this situation, you must use a very
small network which is capable of implementing a function not much
more complex than the optimal boundary (which is a hyperellipsoid or a
hyperplane, depending on the relative covariance of the two
distributions).  In this case the minimum solution will clearly have a
large error.  This sort of situation occurs often in real
classification problems.  For the most part, the only case where this
is not true to some extent is in contrived logic problems that are
probably better solved by something other than neural networks.

If you manage to get a neural network to solve speech recognition with
zero mean squared error we will retract our statements.

>
>(On Newton's Method never fully converging):
>>On the other hand, it will also never reach a stable solution unless you
>>impose arbitrary halting rules.
>
>I pointed this out, but argued that it was a positive feature to have, to
>have the neural net autonomously determine when it learns and when it doesn't.
>The particular halting rule I described was not arbitrary but followed directly
>from the considerations above about when you can say something is "solved" and
>when it's not "solved".  You just provide a cutoff on the error function
>itself.

True your method provides a way to jump out of local minima, but to us
it certainly doesn't seem to pick a very good way to go about it.  If
it were to hit a local minimum exactly where the gradient is zero,
your algorithm will throw the weights to infinity, which doesn't seem
like such a good idea.  Why not just have your algorithm start
training at a new random point if the local minimum is above the
threshold you have chosen?  

Anyway, it still seems to make more sense to search for a minimum
rather than for a zero in the mse function.

E. Wan, M. Lehr, F. Beaufays, (Steve gave up)