Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!sun-barr!newstop!exodus!hanami.Eng.Sun.COM!landman
From: landman@hanami.Eng.Sun.COM (Howard A. Landman)
Newsgroups: comp.ai.neural-nets
Subject: Re: Are Conjugate Gradient algorithms any good?
Message-ID: <10184@exodus.Eng.Sun.COM>
Date: 21 Mar 91 02:10:09 GMT
References: <1991Mar4.142559.21857@daimi.aau.dk> <^9B&5R#@warwick.ac.uk> <pluto.668285404@cornelius> <91Mar7.145659edt.437@neuron.ai.toronto.edu> <9682@exodus.Eng.Sun.COM> <GREENBA.91Mar13081609@gambia.crd.ge.com>
Sender: news@exodus.Eng.Sun.COM
Organization: Sun Microsystems, Mt. View, Ca.
Lines: 49

In article <9682@exodus.Eng.Sun.COM> I wrote:
>>One disadvantage of CG methods is that they often require the whole
>>training set to be memory-resident.  For gigantic training data this
>>can be a real problem.

In article <GREENBA.91Mar13081609@gambia.crd.ge.com> greenba@gambia.crd.ge.com (ben a green) writes:
>I don't understand why this is peculiar to CG methods. Any method that requires
>repeated updating of weights will want to retain the training set in memory
>just in order to avoid being IO-bound.

Assuming that the entire training set can fit into your virtual memory, that's
true, although page faults can cause that to become "IO-bound" as well.
But I had one case where the training data was over 500 MB.  Since my VM size
was less than 500 MB, the program which required data to be memory-resident
simply DIDN'T WORK, but a program that read the data each pass would merely
have been slow.

A more subtle aspect: in some cases (e.g. mine), the "original" training
data is far more dense than the training data which has been massaged into
the input format (or memory layout) of the program.  In extreme cases (e.g.
mine :-) the difference can be greater than two orders of magnitude (2 MB vs
500 MB).  For a "one sample at a time" program, if you have source, it is
possible to embed the code to do this expansion in the program itself, so that
the entire expanded training set never exists anywhere, and all the swap & I/O
problems vanish (at the cost in CPU of reexpanding the data each time).  For
an "all data in memory at once" program, you don't have that choice; the whole
expanded data set must exist in memory even if you can avoid having it on disk.

Even if embedded expansion is possible, it may not always make sense.  This
depends on the relative expense of expanding versus the performance cost of
doing the training with a fully expanded data set.  In my case, expanding
the data was about 1/8th the cost of doing a single CG training cycle, so
the overhead would have been quite acceptable as long as each training cycle
ran more than 12% faster when the program virtual image was 3 MB in size than
when it was 500 MB in size.  (There is also an implicit assumption here that
each datum only needs to be expanded once per training cycle.  This is clearly
true for the "one sample at a time" approaches, but may not be for CG.)  Even
if that speedup was not forthcoming, at least the program would have been able
to handle the full set of training data.

You're right that this doesn't *necessarily* have anything to do with CG
per se, except that the CG program I chose to use ("opt") had the above
features.  I think that line search may more-or-less require all training
data to be present.  If anyone knows of a CG program which *doesn't* need
all data in-memory, please describe it.

--
	Howard A. Landman
	landman@eng.sun.com -or- sun!landman