Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!warwick!esrmm
From: esrmm@warwick.ac.uk (Denis Anthony)
Newsgroups: comp.ai.neural-nets
Subject: Re: Are Conjugate Gradient algorithms any good?
Keywords: NETtalk, Conjugate Gradient algorithms, Back-propagation
Message-ID: <^9B&5R#@warwick.ac.uk>
Date: 5 Mar 91 09:56:56 GMT
References: <1991Mar4.142559.21857@daimi.aau.dk>
Sender: news@warwick.ac.uk (Network news)
Organization: Computing Services, Warwick University, UK
Lines: 58
Nntp-Posting-Host: clover

In article <1991Mar4.142559.21857@daimi.aau.dk> baronen@daimi.aau.dk (Carsten Greve) writes:
>Recently there has been much talk about the so-called Conjugate Gradient
>algorithms and their use in feed-forward neural networks. We have applied
>one of these algorithms, the Scaled Conjugate Gradient algorithm [1], on
>the NETtalk problem [2] with poor results.
>
>In their original experiment Sejnowski and Rosenberg used the conventional
>back-propagation algorithm with weight updates after each presentation of
>one word (word update). Our experiments with the same algorithm confirmed the
>results of Sejnowski and Rosenberg. However, experiments showed that
>back-propagation was unable to converge if the weights were updated only
>after the entire training set had been presented (epoch update).
>
>The SCG algorithm is reported, like several other Conjugate Gradient
>algorithms, to outperform ordinary back-propagation with epoch learning.
>Yet we found that the SCG algorithm was unable to match the performance of
>back-propagation when word updates were used instead. In the SCG algorithm
>weights are (normally) updated once after each presentation of the entire
>training set (epoch update).
>

Others have found SCG not that wonderful e.g.

%Q Chan L
%D 1990
%T Efficacy of Different Learning Algorithms of the Back Propagation Network (in Conference Proc. Computer and Communication Systems)
%J Proceedings of the IEEE Region 10

who found it performed worse that dynamic learning rate adjustment.

I have experimented with the latter (dynamic learning rate and momentum adjustment) as in

%Q Vogl T.P, Mangis J.K, Rigler A.K, Zink W.T, and Alkon D.L
%D 1988
%T Accelerating the Convergence of the Back-Propagation Method
%J Biological Cybernetics
%V 59
%P 257-263

and found as Carsten Greve reports for SCG, using epoch updates was inferior to pattern updates.
Specifically epoch learning with dynamic adjustment was worse than
pattern learning without any parameter tuning. Changing the algorithm to pattern learning with
adjustment does improve convergence, though for reasons I do not understand the convergence rate
is initially worse, and after a while the learning rate shoots up and convergence surpasses the "ordinary"
method. Is there some theoretical reason why pattern learning should be better ? The maths assumes epoch learning in
Rumelhart et al's original discussion, but they say with small learning rate it should be approximately the same. But
this is not what we find, it seems to be better.

The odd convergence curve (slower than ordinary learning, then changing slope) that I got for dynamic learning
(which did not occur in epoch dynamic learning, there I just got an asymptotic convergence to a high error)
may be due to initial descent in error ravines, where increasing learning rate oscilates from
side to side of the ravine, but each time at a lower total error, so the algorithm keeps increasing learning rate, and
at a point when it arrives at the base of the ravine, then shoots along it as that is the highest slope, and does
so more quickly with the higher learning rate.

Is this reasonable ? Any other ideas ?

Denis.