Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!udel!haven!uvaarpa!murdoch!helga0.acc.Virginia.EDU!aam9n From: aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) Newsgroups: comp.ai.neural-nets Subject: Re: Several hidden layers in feed-forward networks Message-ID: <1991Jan9.183122.16175@murdoch.acc.Virginia.EDU> Date: 9 Jan 91 18:31:22 GMT References: <7165.27885d62@abo.fi> <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> <1991Jan8.091631.16219@warwick.ac.uk> Sender: news@murdoch.acc.Virginia.EDU Organization: University of Virginia Lines: 41 In article <1991Jan8.091631.16219@warwick.ac.uk> esrmm@warwick.ac.uk (Denis Anthony) writes: >In article <1991Jan7.202202.12266@murdoch.acc.Virginia.EDU> aam9n@helga0.acc.Virginia.EDU (Ali Ahmad Minai) writes: >>In article <7165.27885d62@abo.fi> vt_ai@abo.fi writes: >> >>As for why more layers work "better", they often don't. But when they >>do, it is because of the greater potential "complexity" available. >>Think of each neuron in layer k as forming a distorted linear >>superposition of the outputs from the previous layer. If the neurons >>in the net have monotonic activation functions, as they usually do, >>an output layer neuron in a single hidden-layer net requires about 2n >>hidden neurons to compose a function with n modes (peaks). > >Why 2n ? Is this emprical, or based on maths ? Or is it obvious, >ie. 2n to form n peaks and n troughs. Apologies if I am being >a bit dim. No, you are right. This is neither mathematical nor really empirical. It is just meant to be an approximate argument. I'll try to clarify. Suppose we have a 1-N-1 network, where the first 1 is just an input unit. Assuming that the output neuron uses a linear composition with a monotonic squashing function, and that all hidden nits are monotonic, the output is a distorted linear superposition of the hidden unit activations. Since each hidden unit can only provide one "slope" (due to monotonicity), about 2n will be needed to produce 2n slopes (= n peaks). However, because the hidden unit activations are non-linear and can have very different (albeit monotonic) shapes, it is possible for fewer than 2n hidden units to produce n peaks, but only if function shapes are very variable. In general, superposing 2n monotonic functions of approximately similar shape (but with negative and positive weights) will tend to produce less than n peaks. My argument was that as we increase the number of layers, we provide additional scope for recombination of superpositions from previous layers. Having two hidden layers of N and 2 units is sort of like having one hidden layer of 2N units, because each unit in the second layer forms an independent version of the first layer, both of which are then available to the next layer. Again, this is not meant to be a theorem, just an illustrative argument. Of course, I assume that a layer takes input only from its immediate predecessor. Ali Minai