Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!udel!haven!uvaarpa!murdoch!helga0.acc.Virginia.EDU!aam9n
From: aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai)
Newsgroups: comp.ai.neural-nets
Subject: Re: Several hidden layers in feed-forward networks
Message-ID: <1991Jan9.183122.16175@murdoch.acc.Virginia.EDU>
Date: 9 Jan 91 18:31:22 GMT
References: <7165.27885d62@abo.fi> <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> <1991Jan8.091631.16219@warwick.ac.uk>
Sender: news@murdoch.acc.Virginia.EDU
Organization: University of Virginia
Lines: 41

In article <1991Jan8.091631.16219@warwick.ac.uk> esrmm@warwick.ac.uk (Denis Anthony) writes:
>In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes:
>>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes:
>>
>>As for why more layers work "better", they often don't. But when they
>>do, it is because of the greater potential "complexity" available.
>>Think of each neuron in layer k as forming a distorted linear
>>superposition of the outputs from the previous layer. If the neurons
>>in the net have monotonic activation functions, as they usually do,
>>an output layer neuron in a single hidden-layer net requires about 2n
>>hidden neurons to compose a function with n modes (peaks).
>
>Why 2n ? Is this emprical, or based on maths ? Or is it obvious,
>ie. 2n to form n peaks and n troughs. Apologies if I am being
>a bit dim.

No, you are right. This is neither mathematical nor really empirical.
It is just meant to be an approximate argument. I'll try to clarify.

Suppose we have a 1-N-1 network, where the first 1 is just an input unit.
Assuming that the output neuron uses a linear composition with a
monotonic squashing function, and that all hidden nits are monotonic,
the output is a distorted linear superposition of the hidden unit
activations. Since each hidden unit can only provide one "slope"
(due to monotonicity), about 2n will be needed to produce 2n slopes
(= n peaks). However, because the hidden unit activations are
non-linear and can have very different (albeit monotonic) shapes,
it is possible for fewer than 2n hidden units to produce n peaks, but
only if function shapes are very variable. In general, superposing
2n monotonic functions of approximately similar shape (but with negative
and positive weights) will tend to produce less than n peaks. My
argument was that as we increase the number of layers, we provide 
additional scope for recombination of superpositions from previous
layers. Having two hidden layers of N and 2 units is sort of like
having one hidden layer of 2N units, because each unit in the second
layer forms an independent version of the first layer, both of which
are then available to the next layer. Again, this is not meant to be
a theorem, just an illustrative argument. Of course, I assume that
a layer takes input only from its immediate predecessor.

Ali Minai