Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!att!linac!uwm.edu!csd4.csd.uwm.edu!markh
From: markh@csd4.csd.uwm.edu (Mark William Hopkins)
Newsgroups: comp.ai.neural-nets
Subject: Using Newton's Method to speed up backpropagation.
Message-ID: <7556@uwm.edu>
Date: 10 Nov 90 07:21:32 GMT
Sender: news@uwm.edu
Organization: University of Wisconsin - Milwaukee
Lines: 78


   A typical intro. to backpropagation that goes into any kind of analysis in
depth will derive the weight (and threshold) corrections used in backprop.
from the Gradient Descent method applied to a certain least-squares error
function.

   The Gradient Descent method tries to find the minimum of this function,
which is supposed to represent the optimal state for the system to be in.
Associated with this method is a constant, usually denoted as the Greek letter
ETA, which controls the learning rate of the system.  You have to set it "by
hand".  If you make it too large, the system oscillates and fails to learn, and
if you set it too small, it learns too slow.

   So, the obvious idea occurred to me way back in November of last year: use
Newton's Method to find the zero of the error function.  Newton's Method does
not generally get stuck in local minima because it's not even trying to find
minima in the first place.  And when you think about it, finding minima of
the error function wasn't really even the goal in the first place -- finding
the zero (or a near-zero) was.  Also, when it converges it converges fast.

   You might think that using Newton's Method for backprop. is going to entail
a computationally expensive operation of calculating derivatives, maybe a
couple matrix inversions and divisions.  However, the fact is that backprop.
with Newton's Method is VIRTUALLY IDENTICAL to backprop. with Gradient
Descent.  What's the difference?  Newton's Method calculates the optimal
learning "constant" ETA on the fly.

   The only unfortunate things about the method is that it doesn't actually
converge so fast when the function in question behaves like f(x) = x^2 near
the zero, and the error function ALWAYS has this property near its zero.  The
other fault of the method is that it behaves erratically near local minima,
and particularily near points where the error function almost but not quite
reaches 0.

   Now the erratic behavior of Newton's Method near a local minima, when you
think about it, is actually a desired feature.  You WANT your neural net to
jump out of a local minimum when it hits one.  And because you're dividing
by a learning constant that approaches 0 as you near the local minimum to
determine ETA, the system becomes more "volatile" as it approaches such a
point.

   The only time you don't want the system to jump around erratically is when
you're actually nearing the zero.  This is something that can actually be fixed
quite easily by providing a cut-off: no more learning when the error falls
below a certain threshold.  Now what you have is a system that even controls,
autonomously, when it learns: another desired feature.  This also keeps the
system from jumping out from near convergence when there is, in fact, no
actual zero to the error function but just a near-zero.

   A simulation of a Newton's Method backprop. neural net provided some degree
of conformation of these major points.  A simple neural net was set up to
learn the exclusive-or function.  Backprop. without momentum terms exhibits
horribly slow convergence to this function, even when its weights are
initialized to values near convergence.  The problem is that some parts of the
underlying surface of the error function require learning constants, ETA, of
(say) under 0.25, and other parts require learning constants of in excess of
1000000 (!).

   Newton's method determines this dynamically.  When the weights are set
initially near the optimum setting, the neural net literally zooms in on the
zero in a relatively small numbner of steps (under a few hundred).  The
value of ETA can get as high as 1000000 before the learning "cut-off"
mentioned above is applied.  More generally, backpropagation with Newton's
method will converge extremely quickly wheve gradient descent backprop. is
left behind crawling along taking its own sweet time.

   Not everything's perfect, though.  Certain settings of weights and
thresholds on even this relatively simple training function caused the net to
get stuck in a state where it was trying to emulate the exclusive or function
with an or function.  This is definitely related to Newton's method relative
helplessness at points far away from the zero of the error function.

   If you play around with Newton's Method on a one-dimensional graph, you will
find that it always gets stuck in V-shaped valleys -- or in two dimensions,
in cone-shaped "sink holes".  However, Gradient Descent gets stuck here too...

   But generally, using Newton's Method seems to provide a natural way to
accelerate learning when and only when it needs acceleration.