Expressivity Assumptions in Function Approximation

Updated 12 October 2025

The paper’s main contribution is showing that deep networks uniquely create localized and sparse approximations that shallow networks cannot achieve.
It rigorously characterizes how architectural parameters like depth, width, and nonlinearity control expressivity while maintaining comparable statistical capacity.
The work demonstrates, via ERM with a two-hidden-layer design, that expressive deep models can attain near-optimal learning rates for smooth and sparse function classes.

Expressivity assumptions on function approximation concern the characterization, conditions, and limitations under which a model class (such as deep neural networks or other parameterized architectures) is able to approximate target functions with specified properties. Recent research has provided rigorous foundations for understanding how expressivity is controlled by architectural parameters such as depth, width, and type of nonlinearity, and how it interacts with desirable properties such as localization, sparsity, and learning-theoretic generalization.

1. Localized and Sparse Approximation: Deep vs. Shallow Nets

A distinguishing expressivity property of deep networks, as compared to shallow networks with a comparable number of neurons, is the ability to achieve highly localized and sparse approximations. The construction presented uses a two-hidden-layer deep network to generate “indicator-like” responses for spatially localized hypercubes in the input domain. More formally, given a partition of the unit hypercube $\mathbb{I}^d$ into small cubes indexed by $j$ , one can design a network element

$N^*_{n,j,K}(x) = \sigma \left\{ 2K \Bigl[ \sum_{\ell=1}^d \sigma_0(1/(2n) + x^{(\ell)} - \xi_j^{(\ell)}) + \sum_{\ell=1}^d \sigma_0(1/(2n) - x^{(\ell)} + \xi_j^{(\ell)}) - 2d + \frac{1}{2} \Bigr] \right\}$

where $\sigma_0$ is a Heaviside-type activation and $\sigma$ is a smooth sigmoidal function. For $x$ inside the hypercube $A_{n,j}$ , $|1 - N^*_{n,j,K}(x)| \leq \epsilon$ ; outside, it is nearly zero. No shallow network with an equivalent number of neurons can realize such localized “gates.” By assembling such units, deep networks can construct sparse approximators, turning the response on or off depending on location, a critical property for high-dimensional function classes exhibiting spatially limited support or structure.

The ability to generate localized and sparse approximations directly enhances the function repertoire compared to shallow networks, whose lack of composition restricts their capacity to encode these local “bump” or “mask” functions efficiently.

2. Assumptions and Theoretical Framework

The function approximation analysis is established over a function class $\mathcal{F}$ characterized by both smoothness and sparsity. The main regularity condition is $(r, c_0)$ -Lipschitz continuity: for all $x, x' \in \mathbb{I}^d$ ,

$|f_\rho(x) - f_\rho(x')| \leq c_0 \|x - x'\|^r$

Additionally, a structural prior assumes $s$ -sparsity in a partition of $\mathbb{I}^d$ into $N^d$ cubes: a function is $s$ -sparse if its support is contained entirely within $s$ of these cubes.

The hypothesis space for learning is a family $\Phi_{n,2d}$ of two-hidden-layer deep networks with bounded parameters (bounds $\mathcal{B}_n, \mathcal{C}_n, \Xi_n$ ). The second-layer activation $\sigma$ must satisfy a Lipschitz condition $|\sigma(t) - \sigma(t')| \leq C_\sigma |t - t'|$ . Capacity of $\Phi_{n,2d}$ is measured using covering numbers, specifically in $C(\mathbb{I}^d)$ : $\log \mathcal{N}(\epsilon, \Phi_{n,2d}) = \mathcal{O}(n^d \log(n/\epsilon))$ This bound matches that for shallow networks of width $n$ , so that increased depth does not inflate learning-theoretic capacity.

3. Capacity Control and Learning Rates

Despite the increased expressivity, the network capacity (as quantified by the covering number) remains on par with shallow architectures. This key observation enables trading off approximation and estimation error for empirical risk minimization (ERM) training. The rate for smooth function approximation is

$\mathcal{O}(m^{-2r/(2r + d)} \cdot \operatorname{polylog}(m))$

matching minimax rates up to log factors. If the true function is $s$ -sparse in a partition of $N^d$ cubes, the approximation rate improves: $\mathcal{O}\bigg(m^{-2r/(2r + d)}\left(\frac{s}{N^d}\right)^{d/(2r+d)} \cdot \operatorname{polylog}(m)\bigg)$ This demonstrates the advantage of architectures capable of exploiting sparsity, as deep nets can, over shallow nets.

The excess risk for the ERM estimator is decomposed: $\mathcal{E}(\pi_M f_{D, n}) - \mathcal{E}(f_\rho)$ into approximation error (how well $\Phi_{n, 2d}$ can match $f_\rho$ ) and sample error (capacity-limited generalization), with both controlled thanks to the bounded covering number.

4. Practical Implementation: ERM and Architectural Considerations

The paper’s analysis is directly instantiated in ERM learning. For $m$ i.i.d. samples $D_m = \{(x_i, y_i)\}_{i=1}^m$ , the estimator is

$f_{D, n} = \arg \min_{f \in \Phi_{n,2d}} \frac{1}{m} \sum_{i=1}^m [f(x_i) - y_i]^2$

Outputs are projected onto $[-M, M]$ if the true target is known to be bounded: $\pi_M(t) = \begin{cases} M & t > M \ t & |t| \leq M \ -M & t < -M \end{cases}$ Generalization is controlled via covering-number-based concentration inequalities (Bernstein's inequality), allowing derivation of learning rates as described above.

The architectural template—two hidden layers, localized first-stage (step-like) activation, smooth sigmoidal second-stage—enables both the construction of spatial indicator functions and their aggregation for sparse, complex targets.

5. Implications for Expressivity in High-Dimensional Function Spaces

The central message is that deep nets realize a qualitative expressivity advantage—only deep architectures (with at least two hidden layers) can produce localized “mask” functions or sparse aggregations, fundamental for approximating functions with localized or sparse behavior. Shallow networks with matching width and parameter budget fundamentally lack this capability.

Moreover, this expressivity does not come with the cost of increased statistical complexity; capacity, as measured by the covering number, scales essentially the same as for shallow networks. This is nontrivial, as one might otherwise expect resource-hungry deep nets to suffer decreased generalization.

Consequently, deep networks present a compelling architecture in learning settings with high-dimensional, sparse, or spatially fragmented targets, as they can match expressivity to function structure without sacrificing learning-theoretic guarantees.

6. Learning Theory Perspective: Balancing Expressivity and Capacity

From an architectural learning theory standpoint, the work demonstrates that by carefully combining depth for increased expressivity with parameter constraints for bounded capacity, one can achieve near-optimal statistical learning rates. This alignment is achieved through a network that encodes spatial structure in the first hidden layer and smooth, bounded nonlinearities in the second, allowing for both localized approximation and controlled variance.

In summary, the paper provides a theoretical foundation showing that deep networks—by virtue of depth-enabled localized and sparse approximations—attain both greater function expressivity and optimal (or even improved when sparse-structured) learning rates, all while maintaining a statistical capacity comparable to shallow architectures. This justifies depth as an architectural principle for complex function approximation in high dimensions when both expressivity and generalization are required (Lin, 2018).

PDF Markdown Chat (Pro)

References (1)

Generalization and Expressivity for Deep Nets (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Expressivity Assumptions on Function Approximation.