Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!eecae!netnews.upenn.edu!rutgers!njin!princeton!phoenix!mbkennel From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Newsgroups: comp.ai.neural-nets Subject: Re: GD and LSE Message-ID: <7380@phoenix.Princeton.EDU> Date: 25 Mar 89 05:20:35 GMT References: <22206@shemp.CS.UCLA.EDU> Reply-To: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Organization: Princeton University, NJ Lines: 52 In article <22206@shemp.CS.UCLA.EDU> gblee@CS.UCLA.EDU () writes: >I know another similar paper which attacked GD.... > > Sutton, R. Two problems with BP and other steepest-descent learning > procedures for networks. > > He points out: >1. steepest descent is a particulary poor for surface containing "ravines" >2. steepest descent results in high level of interference between learning >with different patterns... > >Unfortunately, I forgot where this paper was published. >Can anybody out there tell where it was published for us? > >--Geunbae Lee > AI lab, UCLA 1) Simply going down the gradient direction has been known for ages to be a poor heuristic if you're in steep valleys. It's simply good fortune that it works in as many cases as it does. In fact, the "momentum method" is a very simple way to alleviate this problem by trying to go faster in the slowly-changing direction. Conjugate gradient is specifically designed to deal with this kind of problem. The disadvantages are 1) algorithm much more complicated to implement 2) it is not always faster. In terms of the number of "iterations", it's certainly faster than fixed-step steepest-descent, but now each iteration requires a gradient evaluation and several error evaluations for an approximate line-minimization. Just try doing that in silicon... Often, many fast dumb steps beat a smart but slow method. In my personal experience, standard steepest descent w/ momentum is often the best in terms of bottom-line wall-clock performance compared to fancier algorithms. 2) I believe that when you add more examples to the training set the error surface becomes more "bumpy" locally, and thus the gradient vector at each point doesn't necessarily point to the overall global minimum as closely. I'm not sure about this and am trying to verify it. Using the momentum method helps for this too, because you effectively end up averaging the gradient vector over several steps. In a network whose error surface is non-linear in the weights (multi-layer perceptron, e.g.) I'm not at all convinced that there could ever be a general-purpose learning algorithm guaranteed to give the global minimum solution. In practical terms, the appropriate criterion is only "good enough". Matt Kennel mbkennel@phoenix.princeton.edu