Expressivity Assumptions in Function Approximation
- The paper’s main contribution is showing that deep networks uniquely create localized and sparse approximations that shallow networks cannot achieve.
- It rigorously characterizes how architectural parameters like depth, width, and nonlinearity control expressivity while maintaining comparable statistical capacity.
- The work demonstrates, via ERM with a two-hidden-layer design, that expressive deep models can attain near-optimal learning rates for smooth and sparse function classes.
Expressivity assumptions on function approximation concern the characterization, conditions, and limitations under which a model class (such as deep neural networks or other parameterized architectures) is able to approximate target functions with specified properties. Recent research has provided rigorous foundations for understanding how expressivity is controlled by architectural parameters such as depth, width, and type of nonlinearity, and how it interacts with desirable properties such as localization, sparsity, and learning-theoretic generalization.
1. Localized and Sparse Approximation: Deep vs. Shallow Nets
A distinguishing expressivity property of deep networks, as compared to shallow networks with a comparable number of neurons, is the ability to achieve highly localized and sparse approximations. The construction presented uses a two-hidden-layer deep network to generate “indicator-like” responses for spatially localized hypercubes in the input domain. More formally, given a partition of the unit hypercube into small cubes indexed by , one can design a network element
where is a Heaviside-type activation and is a smooth sigmoidal function. For inside the hypercube , ; outside, it is nearly zero. No shallow network with an equivalent number of neurons can realize such localized “gates.” By assembling such units, deep networks can construct sparse approximators, turning the response on or off depending on location, a critical property for high-dimensional function classes exhibiting spatially limited support or structure.
The ability to generate localized and sparse approximations directly enhances the function repertoire compared to shallow networks, whose lack of composition restricts their capacity to encode these local “bump” or “mask” functions efficiently.
2. Assumptions and Theoretical Framework
The function approximation analysis is established over a function class characterized by both smoothness and sparsity. The main regularity condition is -Lipschitz continuity: for all ,
Additionally, a structural prior assumes -sparsity in a partition of into cubes: a function is -sparse if its support is contained entirely within of these cubes.
The hypothesis space for learning is a family of two-hidden-layer deep networks with bounded parameters (bounds ). The second-layer activation must satisfy a Lipschitz condition . Capacity of is measured using covering numbers, specifically in : This bound matches that for shallow networks of width , so that increased depth does not inflate learning-theoretic capacity.
3. Capacity Control and Learning Rates
Despite the increased expressivity, the network capacity (as quantified by the covering number) remains on par with shallow architectures. This key observation enables trading off approximation and estimation error for empirical risk minimization (ERM) training. The rate for smooth function approximation is
matching minimax rates up to log factors. If the true function is -sparse in a partition of cubes, the approximation rate improves: This demonstrates the advantage of architectures capable of exploiting sparsity, as deep nets can, over shallow nets.
The excess risk for the ERM estimator is decomposed: into approximation error (how well can match ) and sample error (capacity-limited generalization), with both controlled thanks to the bounded covering number.
4. Practical Implementation: ERM and Architectural Considerations
The paper’s analysis is directly instantiated in ERM learning. For i.i.d. samples , the estimator is
Outputs are projected onto if the true target is known to be bounded: Generalization is controlled via covering-number-based concentration inequalities (Bernstein's inequality), allowing derivation of learning rates as described above.
The architectural template—two hidden layers, localized first-stage (step-like) activation, smooth sigmoidal second-stage—enables both the construction of spatial indicator functions and their aggregation for sparse, complex targets.
5. Implications for Expressivity in High-Dimensional Function Spaces
The central message is that deep nets realize a qualitative expressivity advantage—only deep architectures (with at least two hidden layers) can produce localized “mask” functions or sparse aggregations, fundamental for approximating functions with localized or sparse behavior. Shallow networks with matching width and parameter budget fundamentally lack this capability.
Moreover, this expressivity does not come with the cost of increased statistical complexity; capacity, as measured by the covering number, scales essentially the same as for shallow networks. This is nontrivial, as one might otherwise expect resource-hungry deep nets to suffer decreased generalization.
Consequently, deep networks present a compelling architecture in learning settings with high-dimensional, sparse, or spatially fragmented targets, as they can match expressivity to function structure without sacrificing learning-theoretic guarantees.
6. Learning Theory Perspective: Balancing Expressivity and Capacity
From an architectural learning theory standpoint, the work demonstrates that by carefully combining depth for increased expressivity with parameter constraints for bounded capacity, one can achieve near-optimal statistical learning rates. This alignment is achieved through a network that encodes spatial structure in the first hidden layer and smooth, bounded nonlinearities in the second, allowing for both localized approximation and controlled variance.
In summary, the paper provides a theoretical foundation showing that deep networks—by virtue of depth-enabled localized and sparse approximations—attain both greater function expressivity and optimal (or even improved when sparse-structured) learning rates, all while maintaining a statistical capacity comparable to shallow architectures. This justifies depth as an architectural principle for complex function approximation in high dimensions when both expressivity and generalization are required (Lin, 2018).