Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!columbia!cs!camargo
From: camargo@cs.columbia.edu (Francisco Camargo)
Newsgroups: comp.ai.neural-nets
Subject: Back Propagation question... (follow up)
Message-ID: <226@cs.columbia.edu>
Date: 30 May 89 14:18:30 GMT
Organization: Columbia University Department of Computer Science
Lines: 85

Hi there,

I'm re-posting my previous message together with a reply that I received from
Tony Plate and my reply to him. I'd really appreciate comments on this issue.
Thanks to all.
-----------------------------------------------------------------------------


|In article <224@cs.columbia.edu> you write:
||
||Can anyone put some light in the following issue:
||
||How should one compute the weight adjustments in BackProp ?
||From reading PDP, one gathers the impression that the DELTAS
||should be acumulated over all INPUT PATTERNS and only then
||a STEP is taken towards the gradient. Robins Monroe suggests
||a stochastic algorithm with proved convergency if one takes one
||step at each pattern presentation, but dumps its effect by a factor
||1/k where "k" is the presentation number. Other people,(from codes
||that I've seen flying around) seems to take a STEP a each presentation
||a don't take into account any dumping factors. I've tried myself both
||approaches and they all seem to work. After all, which is the correct way
||of adjusting the weights ? Acumulate the errors over all patterns ? Or, work
||towards the minimum as new patterns are presented.Which are the implications?
||
||Any light is this issue is extremelly appreciated.
||
-----------------------------------------------------------------------------
| There are two standard methods of doing the updates, sometimes called
| "batch" and "online" learning.
|
| In "batch" learning, all the changes are accumulated for one pass through
| all the examples.  At the end of the pass (or "epoch") the update is made.
| Thus, each link requires an extra storage field in which to accumulate
| the changes.
|
| In "online" learning, the change is made after seeing each example.
|
| Some people claim online is better, others claim batch is better.
|
| "dumping" (you mean "weighting") each change by 1/k, where k is the number
| of the example (?) sounds really wierd, do you mean if you had four examples
| in your training set changes from the fourth would be worth only a quarter
| as much as changes from the second? surely you don't mean this!
|
| Some people use a momentum term, and some change the learning rate during
| learning.  Using momentum seems to be generally a good thing, and it's
| easy to do.  Automatically changing the learning rate is much harder.
|
| .....
| ..... Connectionist Learning Algorithms by Hinton....
| .....
|
| tony plate
------------------------------------------------------------------------------
Hi Tony,

Sorry for my previous message being so unspecific. What I meat is that 
the dumping occurs after each "epoch." The idea is that the changes in
the weights tend to be of lesser and lesser importance. Actually, the way the 
algorithm is stated, one should dump (I really mean dump) the step size
by a series of terms {a_k} where "sum({a_k}^2)<infinity", with no restriction
in the sum({a_k}). In any case, using {a_k}=1/k for k="epoch number" should
be enough.

My problem is that I can find any (theoretical) justification for the "online"
method other that "Robins Monroe algorithm" (I may have misspelled his name, 
for which I apologize, but I don't have my references near by). But then, the
"dumping" factor is required for guaranteed convergence. I tried the "online"
method and it does seem to perform better. But, WHY does it work ? How come it
converges so well (despite of making {a_k}=1) ?

I am familiar with the use of "momentum" in the learning process, but I 
really want to understand more the theoretical reasons for the "online"
method. Having started my studies with the "batch" mode, it seems a little
like black magic that the "online" method works.

I have the paper by Hinton, "Connectionist Learning Procedures", CMU-CS-87-115.
Is this the paper you refered to ? Any other improvements to this work?
I appreciate your time and effort.
Thanks,


/Kiko.
camargo@cs.columbia.edu