Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!crdgw1!greenba
From: greenba@gambia.crd.ge.com (ben a green)
Newsgroups: comp.ai.neural-nets
Subject: Re: State of the Art Feed-Forward Network Training Algorithms
Message-ID: <GREENBA.91May17100557@gambia.crd.ge.com>
Date: 17 May 91 14:05:57 GMT
References: <AJ3U.91May17010658@opal.cs.virginia.edu>
	<1991May17.090435.9180@fwi.uva.nl>
Sender: news@crdgw1.crd.ge.com
Organization: GE Corporate Research & Development
Lines: 79
In-reply-to: smagt@fwi.uva.nl's message of 17 May 91 09:04:35 GMT

In article <1991May17.090435.9180@fwi.uva.nl> smagt@fwi.uva.nl (Patrick van der Smagt) writes:

   aj3u@opal.cs.virginia.edu (Asim Jalis) writes:

   >What is the state of the art in training feed-forward networks.
   ...
   >did not see anything that improved performance over plain
   >Backpropagation drastically for the general case.

   The answer to this question is really very, very simple.  Remember
   that teaching a feed-forward network is nothing but minimisation
   of a function

	   E = 1/2  \sum_p (\vec d_p - \vec a)^2

   by varying the parameters W.  This has been investigated a zillion of
   times.  Look up any book on numerical analysis, e.g.,

	   %A W. H. Press
	   %A B. P. Flannery
	   %A S. A. Teukolsky
	   %A W. T. Vetterling
	   %T Numerical Recipes: The Art of Scientific Computing
	   %I Cambridge University Press
	   %C Cambridge
	   %D 1986

	   %A J. Stoer
	   %A R. Bulirsch
	   %T Introduction to Numerical Analysis
	   %I Springer-Verlag
	   %C New York--Hei\-del\-berg--Ber\-lin
	   %D 1980

   to read about improvements to steepest descent (aka gradient descent)
   minimisation.  Amongst the best methods is conjugate gradient optimisation.

   I myself haven't used error back-propagation for over a year, but CG
   instead.  It sizzles.

More needs to be said about conjugate gradient optimisation. I agree
fully with Patrick van der Smagt, but the word is about that CG is no good.
There was a thread on this newsgroup recently "Are CG algorithms any good?"

Through the courtesy of a Government Agency, whose permission I do not have
to name, I recently ran a test training a net on some representations of
speech sounds. They had found that backprop trained to 90% accuracy in
about 150,000 presentations of the training set, while CG could not get
past 70%. They were using a CG implementation that shall be nameless, but
which is made available to the public by a university.

My implementation of CG trained to 90% on this problem in 1676 presentations
of the training set. That's a factor of 89 faster than backprop.

I have no idea why the other implementation failed so badly. There are
many choices to make concerning how to do linesearches, for example.

This is not the only example: Another hit on CG was made in the thesis
of a student of a well-known NN researcher in the Northeast. He said that
CG got stuck in local minima. He was kind enough to share the data with us,
and we trained on the problem very quickly with CG and with no local
minimum problem.

For an introduction to CG applied to NN training, see the excellent article
by Kramer and Sangiovanni in Adv. in Neural Information Processing I, 
pp. 40-48, 1989. Buy the book from Morgan Kaufman, San Mateo, CA, USA,
if you have to.

Please do not ask for my software. GE won't let me give it out.

Ben

--
Ben A. Green, Jr.              
greenba@crd.ge.com
  Speaking only for myself, of course.