Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!sgi!arisia!kanga!chrisley
From: chrisley@kanga.uucp (Ron Chrisley UNTIL 10/3/88)
Newsgroups: comp.ai.neural-nets
Subject: Re: : Step Function. Biases are necessary
Keywords: learning,generalization
Message-ID: <2934@arisia.Xerox.COM>
Date: 12 Sep 89 06:59:09 GMT
References: <1060@rex.cs.tulane.edu> <6980@sdcsvax.UCSD.Edu> <2795@arisia.Xerox.COM> <1829@cbnewsl.ATT.COM>
Sender: news@arisia.Xerox.COM
Reply-To: k.karn@macbeth.stanford.edu (Ron Chrisley)
Organization: Xerox Palo Alto Research Center
Lines: 78

I wrote:

> [...] I do not see how the fact that
> generalization = bias implies the optimality of learning the boundary
> conditions, and would be very interested in having you elaborate on why you
> think it might.
>

Then, Tony Russo said:

"My reply to this is to give a simplified, one-dimensional case...

A boundary is most efficiently (read: learning will be faster)
defined by its location in n-dimesional space. Since neural nets
don't learn this way, the next most efficient definition of a
boundary is obtained by giving examples of two items very (infintessimally)
close to the boundary  but on different sides of it.
In this way, in 1-D space for example, two points can define a boundary.
Those two points or examples are the most important ones to present
to the net.

If, for instance, we wanted to teach the concept of
negative and positive (zero is the boundary),
-1 and +1 (in integer space) would be a sufficient set of examples
(given, of course, some definition of bias).
Conversely, examples like -102312341 and +823456 are not very helpful."

I claim that although there might be algorithms that learn generalization
biases for which the boundary cases provide quickest learning, there are
also algorithms for which this is not the case.  For instance, some algorithms
may learn biases better if you provide exemplars.

I know this is exactly what you are claiming to not be the case, but I don't
yet see an argument.  What is the difference between -1:1 and -100000:100000?
If there is a difference in the quality of bias learning, I am sure that it
is dependent on some assumptions concerning the bias learning algorithm you
have in mind, or concerning the nature of the data.

The "boundary is best" does not seem to be true for arbitrary learning
algorithms, especially for particular generalization tasks.  Consider a 1D
task, where everything within distance D of the origin is in cat 1, and
all points outside of this region are in cat 2.  Now consider the following
way of learning bias:  Start with the bias that after seeing n samples, you
will categorize everything within radius r of any of the samples as the class
of those samples, r being small.  Then, r is increased in a least squares way,
until generalization error is minimized.  Clearly, it would be best to use
samples near the origin to train this task/bias learning algorithm combination.
If samples near the boundaries are used, then there will only be small error
in estimated generalization, resulting in small changes to r, which would
converge to the following classification:  cat 1 if the sample is within 
epsilon (the small value of r) of +D or -D.  But if samples from the interiors
of the classes are used, estimated generalization error will better match
actual error, which will be initially high, resulting in an increase of r.
Thus we will wind up with the following classification: cat 1 if the sample is
within D of 0.

Don't get me wrong, I do think that learning near the boundaries, ala LVQ2,
is a good idea.  But I don't think it is a good idea for all tasks, I am
not convinced that it is a good way to learn 2nd-order *biases* (as opposed
to 1st order distributions), and even if it is good for that, I question
whether it has anything to do with the fact that generalization = bias, as
opposed to the Bayesian arguments Prof. Kohonen gives.  If it were true for
Bayesian reasons, you would also probably be assuming that the bias learning
is performed after you already have a relatively good solution to the problem.

The reason why it was not a good idea in the example I gave was because that
bias learning alg needs information about the entire distribution.  Only
looking at boundaries throws that away.

But of course, I may be off track here.  You certainly seem to hold the
gen=bias => boundary cases implication in high regard.  Please explain if I
have misunderstood.


Ron

By the way, has anybody looked at 2nd order bias learning as I have sketched it
out here?  Thanks to Tony for pointing me in the right direction...