Path: utzoo!attcan!uunet!husc6!rutgers!njin!princeton!phoenix!mbkennel
From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Newsgroups: comp.ai.neural-nets
Subject: Re: Training
Message-ID: <7326@phoenix.Princeton.EDU>
Date: 23 Mar 89 02:34:23 GMT
References: <2698@sun.soe.clarkson.edu> <2351@buengc.BU.EDU> <1577@vicom.COM> <769@cb.ecn.purdue.edu>
Reply-To: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel)
Organization: Princeton University, NJ
Lines: 106

In article <769@cb.ecn.purdue.edu> kavuri@cb.edn.purdue writes:
>In article <1577@vicom.COM>, hal@vicom.COM (Hal Hardenbergh) writes:
>> 
>> A colleague and I have tried several of the back-prop "speedup" methods.  All
>> that we have tried do speed up convergence, to some degree, as determined by
>> the number of epochs (training iterations).  However, none of them reliably
>> provide a speed improvement as measured by wall clock time.
>> 
>> The ones which do (sometimes) provide a slight improvement in wall clock time
>> do not do so reliably.  It's sort of like varying the convergence and momentum
>> factors.  Depending on the random initialization of the weights and biases,
>> c1 and m1 will work better than c2 and m2, and vice versa.
>> 
>> As long as we are simulating artificial neural nets in software (if simulating
>> is the right word here), does anyone know of a back-prop speedup trick which
>> reduces the wall-clock training time?
>> 
>> Hal Hardenbergh  [incl std dsclmr]  hal@vicom.com
>
>
>  This, I believe, could be tried using other inexact search methods 
>with BP. 
>REF: Conjugate-gradient methods with inexact searches , in Mathematics of
>Operations Research vol.3 

I'm using conjugate-gradient minimization in neural networks.  
Each iteration of conjugate gradient gains  significantly more than
"standard" backprop in error surface, but then again, each iteration
requires an entire direction search & minimization which may take
anywhere from 5 to 10 net evaluations (over whole training set or whatever).

It's not always clear whether or not it saves "wall-clock" time.  Sometimes,
when the function is "easy", it sniffs out the minimum very quickly.  It
has the advantage that you're not dependent at all on a "learning parameter".
But, of course, it's significantly more difficult to code than naive
gradient descent, and can't possibly have any neurological justification.

When I minimize the error surface w/ conjugate gradient, I usually take the
sum over most or all elements of the training set.  Conjugate gradient is very
good at searching out local minima, and so I think some precautions must
be noted:

	1)  Don't try to minimize on one example only, and then try
	    minimizing on the next and so on.  C.G. will very quickly
            find a minimum that "memorizes" the first example, and then
	    just as easily forget it and memorize the second, etc. with
	    very little generalization over many examples.

	2)  C.G. will still find local minima or regions where
	    the error decreases very slowly,  even training over all examples.
	    Thus I don't train over the same examples for every iteration
	    of C.G.; I train over some large random subset for a while
	    and then switch off, either training over the whole set
	    or another random subset, depending on how well I did.

It has some advantages though:  it's very dogged, and doesn't screw up
royally very often.  I often find that I can do very well with regular
BP getting down towards a minimum, and then find that it screws up and
starts to climb back up the other side, and because the step size is
too large, can't find the narrow valley at the bottom.  C.G. almost
always will minimize your function, though it can get stuck in local
minima more often.  Please note that I'm doing a "hard" task  where I need
to map continous values to continous values with as high accuracy as
possible.
	    
>
>Gradient evaluation is an expensive computation in BP.  A gradient-reuse
>alogo was suggested whose idea is the following:
>Gradients are reused several times until the resulting weight updates no
>longer lead to a reduction in error.     

In this case, you're just doing a very naive line-minimization in the
gradient direction.  You could do much better by using a good line-min
routine which can interpolate to the presumed minimum much faster.
(I use the BRENT from Numerical Recipes for my C.G.)  I actually implimented
this line-minimization alg., but was disappointed in its performance.  Regular
BP usually did better.

>
>Furthe, usage of Batching makes the serach direction more accurate.
>Batching means :
>Considering the total error (to be minimized) as the sum of squared differnces
> between the desired output and the observed, summed over all patterns.
>
Yes, but if you add on the gradient each time, later examples get the
benefit of whatever global features that have been learnt in earlier
examples.  

I usually find that standard back-prop (update after each weight) w/
momentum is faster.  Indeed, because of its simplicity & speed, I've
found it hard to beat in real wall-clock time.  90% of a back-prop
program's time is spent in the forward iteration & gradient computation
subroutines.  It's very easy to double the complexity of the inner loops
b/c back-prop w/ fixed steps is so simple, and thus really slow down
the program alot.

I use standard back-prop for a while before starting on C.G---when starting
from a random position, C.G. is usually pretty clueless and needs to
get closer in.

>
>                                                                SURYA
>                                                       (FIAT LUX)

Matt Kennel
mbkennel@phoenix.princeton.edu