Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!att!linac!uwm.edu!csd4.csd.uwm.edu!markh From: markh@csd4.csd.uwm.edu (Mark William Hopkins) Newsgroups: comp.ai.neural-nets Subject: Using Newton's Method to speed up backpropagation. Message-ID: <7556@uwm.edu> Date: 10 Nov 90 07:21:32 GMT Sender: news@uwm.edu Organization: University of Wisconsin - Milwaukee Lines: 78 A typical intro. to backpropagation that goes into any kind of analysis in depth will derive the weight (and threshold) corrections used in backprop. from the Gradient Descent method applied to a certain least-squares error function. The Gradient Descent method tries to find the minimum of this function, which is supposed to represent the optimal state for the system to be in. Associated with this method is a constant, usually denoted as the Greek letter ETA, which controls the learning rate of the system. You have to set it "by hand". If you make it too large, the system oscillates and fails to learn, and if you set it too small, it learns too slow. So, the obvious idea occurred to me way back in November of last year: use Newton's Method to find the zero of the error function. Newton's Method does not generally get stuck in local minima because it's not even trying to find minima in the first place. And when you think about it, finding minima of the error function wasn't really even the goal in the first place -- finding the zero (or a near-zero) was. Also, when it converges it converges fast. You might think that using Newton's Method for backprop. is going to entail a computationally expensive operation of calculating derivatives, maybe a couple matrix inversions and divisions. However, the fact is that backprop. with Newton's Method is VIRTUALLY IDENTICAL to backprop. with Gradient Descent. What's the difference? Newton's Method calculates the optimal learning "constant" ETA on the fly. The only unfortunate things about the method is that it doesn't actually converge so fast when the function in question behaves like f(x) = x^2 near the zero, and the error function ALWAYS has this property near its zero. The other fault of the method is that it behaves erratically near local minima, and particularily near points where the error function almost but not quite reaches 0. Now the erratic behavior of Newton's Method near a local minima, when you think about it, is actually a desired feature. You WANT your neural net to jump out of a local minimum when it hits one. And because you're dividing by a learning constant that approaches 0 as you near the local minimum to determine ETA, the system becomes more "volatile" as it approaches such a point. The only time you don't want the system to jump around erratically is when you're actually nearing the zero. This is something that can actually be fixed quite easily by providing a cut-off: no more learning when the error falls below a certain threshold. Now what you have is a system that even controls, autonomously, when it learns: another desired feature. This also keeps the system from jumping out from near convergence when there is, in fact, no actual zero to the error function but just a near-zero. A simulation of a Newton's Method backprop. neural net provided some degree of conformation of these major points. A simple neural net was set up to learn the exclusive-or function. Backprop. without momentum terms exhibits horribly slow convergence to this function, even when its weights are initialized to values near convergence. The problem is that some parts of the underlying surface of the error function require learning constants, ETA, of (say) under 0.25, and other parts require learning constants of in excess of 1000000 (!). Newton's method determines this dynamically. When the weights are set initially near the optimum setting, the neural net literally zooms in on the zero in a relatively small numbner of steps (under a few hundred). The value of ETA can get as high as 1000000 before the learning "cut-off" mentioned above is applied. More generally, backpropagation with Newton's method will converge extremely quickly wheve gradient descent backprop. is left behind crawling along taking its own sweet time. Not everything's perfect, though. Certain settings of weights and thresholds on even this relatively simple training function caused the net to get stuck in a state where it was trying to emulate the exclusive or function with an or function. This is definitely related to Newton's method relative helplessness at points far away from the zero of the error function. If you play around with Newton's Method on a one-dimensional graph, you will find that it always gets stuck in V-shaped valleys -- or in two dimensions, in cone-shaped "sink holes". However, Gradient Descent gets stuck here too... But generally, using Newton's Method seems to provide a natural way to accelerate learning when and only when it needs acceleration.