Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!agate!shelby!portia.stanford.edu!whirlwind.Stanford.EDU!wan From: wan@whirlwind.Stanford.EDU (Eric A. Wan) Newsgroups: comp.ai.neural-nets Subject: Re: Backpropagation with Newton's Method, and recurrence. Source code. Message-ID: <1990Nov21.213353.16460@portia.Stanford.EDU> Date: 21 Nov 90 21:33:53 GMT References: <7607@uwm.edu> <1248@helens.Stanford.EDU> <7684@uwm.edu> Sender: news@portia.Stanford.EDU (USENET News System) Organization: Stanford University, California, USA Lines: 92 In article <7684@uwm.edu> markh@csd4.csd.uwm.edu (Mark William Hopkins) writes: >In article <1248@helens.Stanford.EDU> wan@isl.Stanford.EDU (Eric A. Wan) writes: >>You mention the method does not work well for f(x) = x^2. In fact, Newton's >>Method applied to the gradient of x^2 converges in one step. > >You lost me here. Try using Newton's method to find a zero of x^2. This is >what I was describing. It has the same (relatively poor) performance as >binary search: > > x[n+1] = x[n] - x[n]^2/(2*x[n]) = 1/2 * x[n]. > I think you still don't quite understand what we mean. Near the current value of X_k look at the first two terms of the Taylor expansion: f(X) = f(X_k) + grad(f)(X-X_k) + (1/2)[(X-X_k)^T]*H*(X-X_k) + O(X^3) Now take the gradient of the equation (dropping higher order terms) grad(f) = grad(f(X_k)) + H*(X-X_k) Using Newton's to find the zero of the gradient (which corresponds to a minimum if H is positive definite) yields the following algorithm X_{k+1) = X_k - H^(-1)*grad(f) (1) This is the version of Newton's Method for minimizing a function found in most optimization texts. Now consider the example of f(x) = x^2. (grad = 2x, H = grad(grad) = 2) From equation (1) with any initial value of x = x_0 x_1 = x_0 - (1/2)*2x_0 = x_0 - x_0 = 0 One step convergence to the minimum! For any general hyperdimensional quadratic one step convergence is also attained (not so if you search for a zero). >You have to concede: it's a valid point to say that the goal is to find a zero >or near-zero of the error function. It's no use finding any minima if they >aren't 'near' zero. And if there aren't any minima AT ALL near a zero on the >surface, then how could it even make sense to talk about convergence in the >first place? A "solution" with a significant error is simply not a solution, >even if it's "optimum". Sorry, but we don't agree. The goal of training neural networks is finding a minimum which doesn't have to be a zero. For a simple example, suppose you want to classify points drawn from two highly overlapping Gaussian distributions. In this case the optimal solution will have a large error. If the network were to have a large number of weights, you could cause it to classify the points in your training set perfectly, and the error of the trained system would be very close to zero. Such a network, however, would perform very poorly outside the training set because you are overfitting the training data. In order for good generalization in this situation, you must use a very small network which is capable of implementing a function not much more complex than the optimal boundary (which is a hyperellipsoid or a hyperplane, depending on the relative covariance of the two distributions). In this case the minimum solution will clearly have a large error. This sort of situation occurs often in real classification problems. For the most part, the only case where this is not true to some extent is in contrived logic problems that are probably better solved by something other than neural networks. If you manage to get a neural network to solve speech recognition with zero mean squared error we will retract our statements. > >(On Newton's Method never fully converging): >>On the other hand, it will also never reach a stable solution unless you >>impose arbitrary halting rules. > >I pointed this out, but argued that it was a positive feature to have, to >have the neural net autonomously determine when it learns and when it doesn't. >The particular halting rule I described was not arbitrary but followed directly >from the considerations above about when you can say something is "solved" and >when it's not "solved". You just provide a cutoff on the error function >itself. True your method provides a way to jump out of local minima, but to us it certainly doesn't seem to pick a very good way to go about it. If it were to hit a local minimum exactly where the gradient is zero, your algorithm will throw the weights to infinity, which doesn't seem like such a good idea. Why not just have your algorithm start training at a new random point if the local minimum is above the threshold you have chosen? Anyway, it still seems to make more sense to search for a minimum rather than for a zero in the mse function. E. Wan, M. Lehr, F. Beaufays, (Steve gave up)