Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!ucsd!sdcsvax!beowulf!demers From: demers@beowulf.ucsd.edu (David E Demers) Newsgroups: comp.ai.neural-nets Subject: Re: Back Propagation question... (follow up) Message-ID: <6532@sdcsvax.UCSD.Edu> Date: 30 May 89 19:23:47 GMT References: <226@cs.columbia.edu> Sender: nobody@sdcsvax.UCSD.Edu Reply-To: demers@beowulf.UCSD.EDU (David E Demers) Organization: EE/CS Dept. U.C. San Diego Lines: 64 In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes: >I'm re-posting my previous message together with a reply that I received from >Tony Plate and my reply to him. I'd really appreciate comments on this issue. >----------------------------------------------------------------------------- |In article <224@cs.columbia.edu> [Francisco Camargo] writes: ||How should one compute the weight adjustments in BackProp ? ||From reading PDP, one gathers the impression that the DELTAS ||should be acumulated over all INPUT PATTERNS and only then ||a STEP is taken towards the gradient. Robins Monroe suggests ||a stochastic algorithm with proved convergency if one takes one ||step at each pattern presentation, but dumps its effect by a factor ||1/k where "k" is the presentation number. Other people,(from codes ||that I've seen flying around) seems to take a STEP a each presentation ||a don't take into account any dumping factors. I've tried myself both ||approaches and they all seem to work. After all, which is the correct way ||of adjusting the weights ? Acumulate the errors over all patterns ? Or, work ||towards the minimum as new patterns are presented.Which are the implications? ----------------------------------------------------------------------------- [Tony replies] | There are two standard methods of doing the updates, sometimes called | "batch" and "online" learning. | | In "batch" learning, all the changes are accumulated for one pass through | all the examples. At the end of the pass (or "epoch") the update is made. | Some people use a momentum term, and some change the learning rate during | learning. Using momentum seems to be generally a good thing, and it's | easy to do. Automatically changing the learning rate is much harder. [No it's not...] >------------------------------------------------------------------------------ [Francisco tries to explain what he means by "dumping", and the "Robins Monroe" algorithm...] >"dumping" factor is required for guaranteed convergence. I tried the "online" >method and it does seem to perform better. But, WHY does it work ? How come it >converges so well (despite of making {a_k}=1) ? > >I am familiar with the use of "momentum" in the learning process, but I >really want to understand more the theoretical reasons for the "online" >method. Having started my studies with the "batch" mode, it seems a little >like black magic that the "online" method works. > >I have the paper by Hinton, "Connectionist Learning Procedures", CMU-CS-87-115. >Is this the paper you refered to ? Any other improvements to this work? Sorry to quote so much of the prior postings, but I thought it worth it to retain context. I am not sure that I fully understand Francisco's question. But I'll answer it anyway :-) Essentially, what backpropogation is trying to do is to acheive a minimum mean squared error by following the gradient of the error as a function of the weights. The "batch" method works well because you get a good picture of the true gradient after seeing all of the input-output pairs. However, as long as corrections are made which go "downhill", then we will converge (possibly to a local rather than global minimum). Making weight changes after presentation of each training example will not necessarily follow the gradient, but with a small learning rate, in the aggregate we will still be moving downhill (reducing MSE). Dave