Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!mips!pacbell.com!ucsd!sdcc6!cornelius!pluto From: pluto@cornelius.ucsd.edu (Mark Plutowski) Newsgroups: comp.ai.neural-nets Subject: Re: Are Conjugate Gradient algorithms any good? Keywords: NETtalk, Conjugate Gradient algorithms, Back-propagation Message-ID: Date: 6 Mar 91 18:50:04 GMT References: <1991Mar4.142559.21857@daimi.aau.dk> <^9B&5R#@warwick.ac.uk> Sender: news@sdcc6.ucsd.edu Lines: 65 Reiterating the discussion so far: Some quite interesting empirical results comparing two gradient descent algorithms: (1) a version of conjugate gradient ("Scaled Conjugate Gradient") (2) backpropagation. Variations of these two algorithms are formed by processing the training examples in the following modes: (a) batched (a.k.a. "epoch") learning: the calculation of each weight update utilizes information from all of the available training examples. (b) pattern learning: each weight update is performed after presentation of a single example. (c) "Block" learning: this lies between (a) and (b) in the sense that each weight update utilizes a subset of the available training examples. Remarks: I was most interested in the report by Denis Anthony, i.e., >... using epoch updates was inferior to pattern updates. except it was unclear from context whether this refers to a comparison of (1a) v. (1b), or between (2a) and (2b). Regarding the usage of the Scaled Conjugate Gradient algorithm to obtain the combination (1b), unless there is some aspect of Scaled Conjugate Gradient that makes it fundamentally different from the usual conjugate gradient techniques, I cannot imagine many situations where (1b) could perform anywhere nearly as well as (1a). Can someone clarify for me what SCG brings to the usual CG techniques? In general, if you are doing a line search in the gradient direction obtained by conjugate gradient, (which is the usual way the CG direction is utilized to obtain a weight update) this direction should be obtained using ALL of the training examples. Essentially, by doing a line search upon the direction obtained by (1b) you will likely destroy learning obtained over previous examples. The math suggests that (1b) should fail miserably; variants of conjugate gradient that allow (1b) to work at all would be most interesting, since they are (seemingly) finding a conjugate gradient direction compatible with all of the examples by looking at only a subset of the examples; it seems plausible that (1c) should work better than (1b), yet not as well as (1a) (unless this subset of training examples is chosen judiciously.) I have some ideas about how to improve upon (1c) if anyone is interested in trying it out on an application. The comparison of (2a) and (2b) is most interesting: of interest would be a simultaneous comparison with a modified Gauss-Newton approach, (where the "Hessian" matrix is used to scale the weight update appropriately for each weight.) (Recall that the backpropagation update is the special case of the Gauss-Newton update obtained by setting the Hessian to the identity matrix.) [Suggestion: throw away the off-diagonal terms and use only the diagonal terms; the resulting matrix is always invertible when the diagonal terms are non-zero. This may alleviate the technical difficulties normally encountered with using the Hessian. In a sense, this approach scales the learning rate appropriately for each weight. Again, I would suspect that this would be best done by batching, but would be interested in any empirical evidence to the contrary.]