Path: utzoo!attcan!utgpu!utstat!jarvis.csri.toronto.edu!rutgers!njin!princeton!phoenix!mbkennel
From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Newsgroups: comp.ai.neural-nets
Subject: Re: Back Propagation question... (follow up)
Message-ID: <8795@phoenix.Princeton.EDU>
Date: 30 May 89 20:28:32 GMT
References: <226@cs.columbia.edu>
Reply-To: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Organization: Princeton University, NJ
Lines: 46

In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes:
>[stuff deleted]
>
>My problem is that I can find any (theoretical) justification for the "online"
>method other that "Robins Monroe algorithm" (I may have misspelled his name, 
>for which I apologize, but I don't have my references near by). But then, the
>"dumping" factor is required for guaranteed convergence. I tried the "online"
>method and it does seem to perform better. But, WHY does it work ? How come it
>converges so well (despite of making {a_k}=1) ?

>
>I am familiar with the use of "momentum" in the learning process, but I 
>really want to understand more the theoretical reasons for the "online"
>method. Having started my studies with the "batch" mode, it seems a little
>like black magic that the "online" method works.

I have an intuitive explanation, but it's not rigorous by any means, and
it could even be completely wrong, but here goes...

In most problems, there is some underlying regularity that _all_ examples
possess that you're trying to learn.  Thus, if you update the weights
after each example, you get the benefit of learning from the previous
examples, but if you only update after a whole run through the training
set, it takes much longer to learn this regularity.

In my experiments, I've found that "online" learning works much better
at the beginning, when the network is completely untrained, because
presumably it's learning the general features of the whole set quickly,
but later on, when trying to learn the fine distinctions among examples,
"online" learning does worse, because it tries to "memorize" each example
in turn instead of learning the whole mapping.  In this regime, you
have to use batch learning.

For many problems though, you never need this level of accuracy (I needed
continuous-valued outputs accurate to <1%) and so "online" learning
is good enough, and often significantly faster, especially with
momentum.  Momentum smooths out the weight changes from a few
recent examples.  (Actually, for my stuff, I like conjugate gradient
on the whole "batch" error surface.)

>/Kiko.
>camargo@cs.columbia.edu

Matt Kennel
mbkennel@phoenix.princeton.edu (6 more days only!!! )
kennel@cognet.ucla.edu  (after that)