Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!mips!pacbell.com!ucsd!sdcc6!cornelius!pluto
From: pluto@cornelius.ucsd.edu (Mark Plutowski)
Newsgroups: comp.ai.neural-nets
Subject: Re: Are Conjugate Gradient algorithms any good?
Keywords: NETtalk, Conjugate Gradient algorithms, Back-propagation
Message-ID: <pluto.668285404@cornelius>
Date: 6 Mar 91 18:50:04 GMT
References: <1991Mar4.142559.21857@daimi.aau.dk> <^9B&5R#@warwick.ac.uk>
Sender: news@sdcc6.ucsd.edu
Lines: 65

Reiterating the discussion so far:
Some quite interesting empirical results comparing two gradient descent algorithms:

(1) a version of conjugate gradient ("Scaled Conjugate Gradient") 
(2) backpropagation.


Variations of these two algorithms are formed by processing the training examples 
in the following modes:

(a) batched (a.k.a. "epoch") learning:  the calculation of each weight 
update utilizes information from all of the available training examples.  

(b) pattern learning:  each weight update is performed after presentation 
of a single example.

(c) "Block" learning: this lies between (a) and (b) in the sense that each 
weight update utilizes a subset of the available training examples.


Remarks:
I was most interested in the report by Denis Anthony, i.e., 

	>... using epoch updates was inferior to pattern updates.

except it was unclear from context whether this refers to 
a comparison of (1a) v. (1b), or between  (2a) and (2b).  


Regarding the usage of the Scaled Conjugate Gradient algorithm to obtain
the combination (1b), unless there is some aspect of Scaled Conjugate Gradient
that makes it fundamentally different from the usual conjugate gradient 
techniques, I cannot imagine many situations where (1b) could perform 
anywhere nearly as well as (1a).  Can someone clarify for me what SCG brings to 
the usual CG techniques?  

In general, if you are doing a line search in the gradient direction 
obtained by conjugate gradient, (which is the usual way the CG direction is
utilized to obtain a weight update) this direction should
be obtained using ALL of the training examples.  Essentially, by doing a 
line search upon the direction obtained by (1b) you will likely destroy 
learning obtained over previous examples.

The math suggests that (1b) should fail miserably; variants of 
conjugate gradient that allow (1b) to work at all would be most interesting, since
they are (seemingly) finding a conjugate gradient direction compatible
with all of the examples by looking at only a subset of the examples;  
it seems plausible that (1c) should work better than (1b), yet not as well as
(1a)  (unless this subset of training examples is chosen judiciously.)  
I have some ideas about how to improve upon (1c) if anyone is interested in 
trying it out on an application.

The comparison of (2a) and (2b) is most interesting: of interest would be 
a simultaneous comparison with a modified Gauss-Newton approach, (where
the "Hessian" matrix is used to scale the weight update appropriately for 
each weight.)  (Recall that the backpropagation update is the special case 
of the Gauss-Newton update obtained by setting the Hessian to the identity matrix.)

[Suggestion: throw away the off-diagonal terms and use only
the diagonal terms;  the resulting matrix is always invertible when the diagonal 
terms are non-zero.  This may alleviate the technical difficulties 
normally encountered with using the Hessian.  In a sense, this approach 
scales the learning rate appropriately for each weight.  Again, I would suspect 
that this would be best done by batching, but would be interested in any 
empirical evidence to the contrary.]