Path: utzoo!attcan!utgpu!utstat!jarvis.csri.toronto.edu!rutgers!njin!princeton!phoenix!mbkennel From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Newsgroups: comp.ai.neural-nets Subject: Re: Back Propagation question... (follow up) Message-ID: <8795@phoenix.Princeton.EDU> Date: 30 May 89 20:28:32 GMT References: <226@cs.columbia.edu> Reply-To: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Organization: Princeton University, NJ Lines: 46 In article <226@cs.columbia.edu> camargo@cs.columbia.edu (Francisco Camargo) writes: >[stuff deleted] > >My problem is that I can find any (theoretical) justification for the "online" >method other that "Robins Monroe algorithm" (I may have misspelled his name, >for which I apologize, but I don't have my references near by). But then, the >"dumping" factor is required for guaranteed convergence. I tried the "online" >method and it does seem to perform better. But, WHY does it work ? How come it >converges so well (despite of making {a_k}=1) ? > >I am familiar with the use of "momentum" in the learning process, but I >really want to understand more the theoretical reasons for the "online" >method. Having started my studies with the "batch" mode, it seems a little >like black magic that the "online" method works. I have an intuitive explanation, but it's not rigorous by any means, and it could even be completely wrong, but here goes... In most problems, there is some underlying regularity that _all_ examples possess that you're trying to learn. Thus, if you update the weights after each example, you get the benefit of learning from the previous examples, but if you only update after a whole run through the training set, it takes much longer to learn this regularity. In my experiments, I've found that "online" learning works much better at the beginning, when the network is completely untrained, because presumably it's learning the general features of the whole set quickly, but later on, when trying to learn the fine distinctions among examples, "online" learning does worse, because it tries to "memorize" each example in turn instead of learning the whole mapping. In this regime, you have to use batch learning. For many problems though, you never need this level of accuracy (I needed continuous-valued outputs accurate to <1%) and so "online" learning is good enough, and often significantly faster, especially with momentum. Momentum smooths out the weight changes from a few recent examples. (Actually, for my stuff, I like conjugate gradient on the whole "batch" error surface.) >/Kiko. >camargo@cs.columbia.edu Matt Kennel mbkennel@phoenix.princeton.edu (6 more days only!!! ) kennel@cognet.ucla.edu (after that)