Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!ucsd!sdcsvax!beowulf!demers
From: demers@beowulf.ucsd.edu (David E Demers)
Newsgroups: comp.ai.neural-nets
Subject: Re: Back Propagation question... (follow up)
Message-ID: <6532@sdcsvax.UCSD.Edu>
Date: 30 May 89 19:23:47 GMT
References: <226@cs.columbia.edu>
Sender: nobody@sdcsvax.UCSD.Edu
Reply-To: demers@beowulf.UCSD.EDU (David E Demers)
Organization: EE/CS Dept. U.C. San Diego
Lines: 64

In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes:
>I'm re-posting my previous message together with a reply that I received from
>Tony Plate and my reply to him. I'd really appreciate comments on this issue.
>-----------------------------------------------------------------------------
|In article <224@cs.columbia.edu> [Francisco Camargo] writes:
||How should one compute the weight adjustments in BackProp ?
||From reading PDP, one gathers the impression that the DELTAS
||should be acumulated over all INPUT PATTERNS and only then
||a STEP is taken towards the gradient. Robins Monroe suggests
||a stochastic algorithm with proved convergency if one takes one
||step at each pattern presentation, but dumps its effect by a factor
||1/k where "k" is the presentation number. Other people,(from codes
||that I've seen flying around) seems to take a STEP a each presentation
||a don't take into account any dumping factors. I've tried myself both
||approaches and they all seem to work. After all, which is the correct way
||of adjusting the weights ? Acumulate the errors over all patterns ? Or, work
||towards the minimum as new patterns are presented.Which are the implications?
-----------------------------------------------------------------------------
[Tony replies]
| There are two standard methods of doing the updates, sometimes called
| "batch" and "online" learning.
|
| In "batch" learning, all the changes are accumulated for one pass through
| all the examples.  At the end of the pass (or "epoch") the update is made.
| Some people use a momentum term, and some change the learning rate during
| learning.  Using momentum seems to be generally a good thing, and it's
| easy to do.  Automatically changing the learning rate is much harder.

[No it's not...]
>------------------------------------------------------------------------------
[Francisco tries to explain what he means by "dumping", and the
"Robins Monroe" algorithm...]
>"dumping" factor is required for guaranteed convergence. I tried the "online"
>method and it does seem to perform better. But, WHY does it work ? How come it
>converges so well (despite of making {a_k}=1) ?
>
>I am familiar with the use of "momentum" in the learning process, but I 
>really want to understand more the theoretical reasons for the "online"
>method. Having started my studies with the "batch" mode, it seems a little
>like black magic that the "online" method works.
>
>I have the paper by Hinton, "Connectionist Learning Procedures", CMU-CS-87-115.
>Is this the paper you refered to ? Any other improvements to this work?


Sorry to quote so much of the prior postings, but I thought it worth
it to retain context.

I am not sure that I fully understand Francisco's question.  But I'll
answer it anyway :-)  

Essentially, what backpropogation is trying to do is to acheive a minimum
mean squared error by following the gradient of the error as a function
of the weights.  The "batch" method works well because you get a good
picture of the true gradient after seeing all of the input-output
pairs.  However, as long as corrections are made which go "downhill",
then we will converge (possibly to a local rather than global minimum).
Making weight changes after presentation of each training example
will not necessarily follow the gradient, but with a small learning
rate, in the aggregate we will still be moving downhill (reducing
MSE).


Dave