Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!eecae!netnews.upenn.edu!rutgers!njin!princeton!phoenix!mbkennel
From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Newsgroups: comp.ai.neural-nets
Subject: Re: GD and LSE
Message-ID: <7380@phoenix.Princeton.EDU>
Date: 25 Mar 89 05:20:35 GMT
References: <22206@shemp.CS.UCLA.EDU>
Reply-To: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Organization: Princeton University, NJ
Lines: 52

In article <22206@shemp.CS.UCLA.EDU> gblee@CS.UCLA.EDU () writes:
>I know another similar paper which attacked GD....
>
> Sutton, R. Two problems with BP and other steepest-descent learning
> procedures for networks.
>
> He points out:
>1. steepest descent is a particulary poor for surface containing "ravines"
>2. steepest descent results in high level of interference between learning
>with different patterns...
>
>Unfortunately, I forgot where this paper was published.
>Can anybody out there tell where it was published for us?
>
>--Geunbae Lee
>  AI lab, UCLA


1)  Simply going down the gradient direction has been known for ages
to be a poor heuristic if you're in steep valleys.  It's simply
good fortune that it works in as many cases as it does.  In fact,
the "momentum method" is a very simple way to alleviate this problem
by trying to go faster in the slowly-changing direction.

Conjugate gradient is specifically designed to deal with this kind
of problem.  The disadvantages are 1) algorithm much more complicated
to implement 2) it is not always faster.  In terms of the number
of "iterations", it's certainly faster than fixed-step steepest-descent,
but now each iteration requires a gradient evaluation and several
error evaluations for an approximate line-minimization.  Just try
doing that in silicon...

Often, many fast dumb steps beat a smart but slow method.  In my personal
experience, standard steepest descent w/ momentum is often the best in
terms of bottom-line wall-clock performance compared to fancier
algorithms.

2)  I believe that when you add more examples to the training set
the error surface becomes more "bumpy" locally, and thus the gradient
vector at each point doesn't necessarily point to the overall global
minimum as closely.  I'm not sure about this and am trying to verify it.
Using the momentum method helps for this too, because you effectively end up
averaging the gradient vector over several steps.

In a network whose error surface is non-linear in the weights
(multi-layer perceptron, e.g.) I'm not at all convinced that there
could ever be a general-purpose learning algorithm guaranteed to
give the global minimum solution.  In practical terms, the appropriate
criterion is only "good enough".

Matt Kennel
mbkennel@phoenix.princeton.edu