The Power of Depth for Feedforward Neural Networks (1512.03965v4)

Published 12 Dec 2015 in cs.LG, cs.NE, and stat.ML

Abstract: We show that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, to more than a certain constant accuracy, unless its width is exponential in the dimension. The result holds for virtually all known activation functions, including rectified linear units, sigmoids and thresholds, and formally demonstrates that depth -- even if increased by 1 -- can be exponentially more valuable than width for standard feedforward neural networks. Moreover, compared to related results in the context of Boolean functions, our result requires fewer assumptions, and the proof techniques and construction are very different.

Citations (713)

View on Semantic Scholar

Summary

The paper demonstrates an exponential separation where a 3-layer network approximates a radial function far better than any 2-layer network without exponential width.
The authors employ Fourier analysis and high-dimensional concentration results to prove that 2-layer networks require exponential width to match 3-layer performance.
The findings imply that deeper networks offer significant practical advantages in high-dimensional tasks, guiding more efficient deep learning architectures.

The Power of Depth for Feedforward Neural Networks

Overview

This paper by Eldan and Shamir addresses an important theoretical question within the domain of neural networks and deep learning: the expressive power of neural networks relative to their depth. Specifically, the authors demonstrate that increasing the depth of a feedforward neural network by just one layer can lead to an exponential increase in the network’s ability to approximate certain functions.

Main Results

Exponential Separation: The paper establishes that there is a simple, approximately radial function on $R^d$ that can be represented by a small 3-layer feedforward neural network but cannot be approximated by any 2-layer network to more than a constant accuracy, unless the 2-layer network’s width is exponentially large in the dimension $d$ .
Universality Assumption: The results are broad in scope, applying to standard activation functions such as ReLUs, sigmoids, and threshold functions. The key assumption is that the activation function is universal, meaning a sufficiently large 2-layer network can approximate any univariate Lipschitz function that is non-constant on a bounded domain.
Bounds on Network Width: Specifically, for every dimension $d$ , there exists a function expressible by a 3-layer network of width $O(d^{19/4})$ , but any approximating 2-layer network would require a width of $O(e^{c d})$ for some constant $c$ .

Implications

The results are significant both theoretically and practically. From a theoretical perspective, they provide a formal justification for the empirical success of deep learning architectures—highlighting that depth is crucial for the expressive power of neural networks. Practically, the findings imply that for high-dimensional problems, 3-layer networks can be exponentially more efficient than their 2-layer counterparts, suggesting a natural predilection for deeper networks in real-world applications involving large datasets or high-dimensional input spaces.

Proof Techniques

The core proof is built on several innovative constructions and lemmas:

Radial Function Construction: The authors construct a radial function that is expressible by a 3-layer network but challenging for a 2-layer network to approximate. This function relies on properties of Bessel functions and intricate characteristics of high-dimensional geometry.
Fourier Analysis: The proof employs Fourier transform techniques, showing that the Fourier transform of functions expressible by 2-layer networks is supported on a union of tubes (lines in Fourier space), which contrasts with the more distributed support of the Fourier transform of the radial function used in their construction.
Concentration Results: The analysis leverages high-dimensional concentration results, showing that with high probability, certain functions will have their Fourier mass concentrated in regions that are hard for 2-layer networks to capture.

Future Directions

The techniques and results in this paper open several avenues for future research:

Extending the Separation: Further exploration could reveal whether the exponential separation holds for networks with more than three layers and other, potentially more complex functions.
Other Architectures: The implications of such depth-related results could be extended to other neural network architectures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), possibly leading to similar depth-efficiency separations.
Practical Applications and Algorithms: Implementing these theoretical insights into practical deep learning algorithms can optimize network structures, potentially leading to more efficient training and inference processes in high-dimensional machine learning tasks.

Conclusion

Eldan and Shamir's work compellingly argues for the importance of depth in neural networks. By formally proving that even a single additional layer can exponentially enhance a network's expressiveness, they provide a solid theoretical foundation that supports the empirical success of deep learning models. This balance of theory and practice exemplifies the ongoing advancements in understanding and leveraging neural network architectures for complex learning tasks.

PDF Markdown