Optimal approximation of continuous functions by very deep ReLU networks
(1802.03620v2)
Published 10 Feb 2018 in cs.NE
Abstract: We consider approximations of general continuous functions on finite-dimensional cubes by general deep ReLU neural networks and study the approximation rates with respect to the modulus of continuity of the function and the total number of weights $W$ in the network. We establish the complete phase diagram of feasible approximation rates and show that it includes two distinct phases. One phase corresponds to slower approximations that can be achieved with constant-depth networks and continuous weight assignments. The other phase provides faster approximations at the cost of depths necessarily growing as a power law $L\sim W{\alpha}, 0<\alpha\le 1,$ and with necessarily discontinuous weight assignments. In particular, we prove that constant-width fully-connected networks of depth $L\sim W$ provide the fastest possible approximation rate $|f-\widetilde f|_\infty = O(\omega_f(O(W{-2/\nu})))$ that cannot be achieved with less deep networks.
The paper presents two distinct approximation phases, with shallow networks achieving slower rates and very deep networks reaching optimal rates.
It employs rigorous approximation theory and the bit extraction method to establish theoretical guarantees for scaling network depth linearly with the number of weights.
The findings highlight a critical trade-off between depth and approximation efficiency, offering practical insights for designing high-accuracy deep learning models.
Overview of Optimal Approximation of Continuous Functions by Very Deep ReLU Networks
This paper investigates the approximation capabilities of deep ReLU neural networks in modeling general continuous multivariate functions on finite-dimensional cubes like [0,1]ν. The focus is on how well these networks can approximate such functions in terms of the modulus of continuity and the available network weights (W). A crucial contribution is the derivation of a comprehensive phase diagram delineating feasible approximation rates dependent on network depth (L) and total weights, showing two primary phases marked by different optimization strategies.
Key Findings and Methodology
The research establishes two distinct approximation phases. One involves slower approximations achieved by networks with constant depth and continuous weight assignments. The second facilitates faster rates, necessitating an increase in depth according to a power-law relationship with L∼Wα (0<α≤1) and requiring discontinuous weights.
Specifically, the fastest approximation rate ∥f−f∥∞=O(ωf(O(W−2/ν))) is attainable only when the network depth L is approximately linear in the number of weights (L∼W), surpassing the limits of less deep networks. This marks a significant distinctness from conventional shallow network approaches, which can only achieve ∥f−f∥∞=O(ωf(O(W−1/ν))), signifying a lower approximation efficiency.
Detailed Contributions
Phase Separation in Approximation Rates: The paper delineates two phases in terms of network architecture and weight behavior:
Shallow Phase: Characterized by constant-depth networks achieving ∥f−f∥∞=O(ωf(O(W−1/ν))) with weights that linearly depend on the function.
Deep Phase: Displays a transition to deeper networks where ∥f−f∥∞=O(ωf(O(W−2/ν))) becomes feasible, stealing the upper limits imposed on shallow networks.
Theoretical Guarantees and Optimal Network Architectures: The work bridges a gap between earlier theoretical limitations and practical network designs by showing that fully-connected networks with constant width and increasing depth can reach the optimal asymptotic rates of approximation. This is substantiated through leveraging classical approximation theory and advanced techniques like the bit extraction method.
Complexity and Network Depth: The substantial insight into network depth reveals a necessary trade-off, exploiting very deep architectures to achieve faster approximation rates. For midrange exponents in the function's smoothness properties, intermediate network depths provide a compromise, extending feasible rates beyond those achievable with purely parallel connection models.
Implications and Future Directions
The findings suggest promising directions in deep learning where the strategic design of network depth and weight handling can substantially impact performance. This provides compelling evidence for the effectiveness of deep, very narrow networks akin to ResNets or highway networks in achieving high-accuracy function approximation, offering insights into practical settings like image recognition.
In future work, further investigation into quantization and discrete weight strategies could provide enhanced scalability and efficiency, especially regarding memory and computational overhead. This might also open avenues for the practical implementation of deep learning models on resource-constrained devices, ensuring theoretical results translate into applicable benchmarks.
Overall, the results advance the understanding of neural network capabilities in approximating complex functions, setting a foundation for architecture-driven approaches where depth resonance and weight dynamics play pivotal roles in dictating the approximation thresholds of neural models.