Path: utzoo!attcan!uunet!husc6!rutgers!njin!princeton!phoenix!mbkennel From: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Newsgroups: comp.ai.neural-nets Subject: Re: Training Message-ID: <7326@phoenix.Princeton.EDU> Date: 23 Mar 89 02:34:23 GMT References: <2698@sun.soe.clarkson.edu> <2351@buengc.BU.EDU> <1577@vicom.COM> <769@cb.ecn.purdue.edu> Reply-To: mbkennel@phoenix.Princeton.EDU (Matthew B. Kennel) Organization: Princeton University, NJ Lines: 106 In article <769@cb.ecn.purdue.edu> kavuri@cb.edn.purdue writes: >In article <1577@vicom.COM>, hal@vicom.COM (Hal Hardenbergh) writes: >> >> A colleague and I have tried several of the back-prop "speedup" methods. All >> that we have tried do speed up convergence, to some degree, as determined by >> the number of epochs (training iterations). However, none of them reliably >> provide a speed improvement as measured by wall clock time. >> >> The ones which do (sometimes) provide a slight improvement in wall clock time >> do not do so reliably. It's sort of like varying the convergence and momentum >> factors. Depending on the random initialization of the weights and biases, >> c1 and m1 will work better than c2 and m2, and vice versa. >> >> As long as we are simulating artificial neural nets in software (if simulating >> is the right word here), does anyone know of a back-prop speedup trick which >> reduces the wall-clock training time? >> >> Hal Hardenbergh [incl std dsclmr] hal@vicom.com > > > This, I believe, could be tried using other inexact search methods >with BP. >REF: Conjugate-gradient methods with inexact searches , in Mathematics of >Operations Research vol.3 I'm using conjugate-gradient minimization in neural networks. Each iteration of conjugate gradient gains significantly more than "standard" backprop in error surface, but then again, each iteration requires an entire direction search & minimization which may take anywhere from 5 to 10 net evaluations (over whole training set or whatever). It's not always clear whether or not it saves "wall-clock" time. Sometimes, when the function is "easy", it sniffs out the minimum very quickly. It has the advantage that you're not dependent at all on a "learning parameter". But, of course, it's significantly more difficult to code than naive gradient descent, and can't possibly have any neurological justification. When I minimize the error surface w/ conjugate gradient, I usually take the sum over most or all elements of the training set. Conjugate gradient is very good at searching out local minima, and so I think some precautions must be noted: 1) Don't try to minimize on one example only, and then try minimizing on the next and so on. C.G. will very quickly find a minimum that "memorizes" the first example, and then just as easily forget it and memorize the second, etc. with very little generalization over many examples. 2) C.G. will still find local minima or regions where the error decreases very slowly, even training over all examples. Thus I don't train over the same examples for every iteration of C.G.; I train over some large random subset for a while and then switch off, either training over the whole set or another random subset, depending on how well I did. It has some advantages though: it's very dogged, and doesn't screw up royally very often. I often find that I can do very well with regular BP getting down towards a minimum, and then find that it screws up and starts to climb back up the other side, and because the step size is too large, can't find the narrow valley at the bottom. C.G. almost always will minimize your function, though it can get stuck in local minima more often. Please note that I'm doing a "hard" task where I need to map continous values to continous values with as high accuracy as possible. > >Gradient evaluation is an expensive computation in BP. A gradient-reuse >alogo was suggested whose idea is the following: >Gradients are reused several times until the resulting weight updates no >longer lead to a reduction in error. In this case, you're just doing a very naive line-minimization in the gradient direction. You could do much better by using a good line-min routine which can interpolate to the presumed minimum much faster. (I use the BRENT from Numerical Recipes for my C.G.) I actually implimented this line-minimization alg., but was disappointed in its performance. Regular BP usually did better. > >Furthe, usage of Batching makes the serach direction more accurate. >Batching means : >Considering the total error (to be minimized) as the sum of squared differnces > between the desired output and the observed, summed over all patterns. > Yes, but if you add on the gradient each time, later examples get the benefit of whatever global features that have been learnt in earlier examples. I usually find that standard back-prop (update after each weight) w/ momentum is faster. Indeed, because of its simplicity & speed, I've found it hard to beat in real wall-clock time. 90% of a back-prop program's time is spent in the forward iteration & gradient computation subroutines. It's very easy to double the complexity of the inner loops b/c back-prop w/ fixed steps is so simple, and thus really slow down the program alot. I use standard back-prop for a while before starting on C.G---when starting from a random position, C.G. is usually pretty clueless and needs to get closer in. > > SURYA > (FIAT LUX) Matt Kennel mbkennel@phoenix.princeton.edu