Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 194 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Expressivity Assumptions in Function Approximation

Updated 12 October 2025
  • The paper’s main contribution is showing that deep networks uniquely create localized and sparse approximations that shallow networks cannot achieve.
  • It rigorously characterizes how architectural parameters like depth, width, and nonlinearity control expressivity while maintaining comparable statistical capacity.
  • The work demonstrates, via ERM with a two-hidden-layer design, that expressive deep models can attain near-optimal learning rates for smooth and sparse function classes.

Expressivity assumptions on function approximation concern the characterization, conditions, and limitations under which a model class (such as deep neural networks or other parameterized architectures) is able to approximate target functions with specified properties. Recent research has provided rigorous foundations for understanding how expressivity is controlled by architectural parameters such as depth, width, and type of nonlinearity, and how it interacts with desirable properties such as localization, sparsity, and learning-theoretic generalization.

1. Localized and Sparse Approximation: Deep vs. Shallow Nets

A distinguishing expressivity property of deep networks, as compared to shallow networks with a comparable number of neurons, is the ability to achieve highly localized and sparse approximations. The construction presented uses a two-hidden-layer deep network to generate “indicator-like” responses for spatially localized hypercubes in the input domain. More formally, given a partition of the unit hypercube Id\mathbb{I}^d into small cubes indexed by jj, one can design a network element

Nn,j,K(x)=σ{2K[=1dσ0(1/(2n)+x()ξj())+=1dσ0(1/(2n)x()+ξj())2d+12]}N^*_{n,j,K}(x) = \sigma \left\{ 2K \Bigl[ \sum_{\ell=1}^d \sigma_0(1/(2n) + x^{(\ell)} - \xi_j^{(\ell)}) + \sum_{\ell=1}^d \sigma_0(1/(2n) - x^{(\ell)} + \xi_j^{(\ell)}) - 2d + \frac{1}{2} \Bigr] \right\}

where σ0\sigma_0 is a Heaviside-type activation and σ\sigma is a smooth sigmoidal function. For xx inside the hypercube An,jA_{n,j}, 1Nn,j,K(x)ϵ|1 - N^*_{n,j,K}(x)| \leq \epsilon; outside, it is nearly zero. No shallow network with an equivalent number of neurons can realize such localized “gates.” By assembling such units, deep networks can construct sparse approximators, turning the response on or off depending on location, a critical property for high-dimensional function classes exhibiting spatially limited support or structure.

The ability to generate localized and sparse approximations directly enhances the function repertoire compared to shallow networks, whose lack of composition restricts their capacity to encode these local “bump” or “mask” functions efficiently.

2. Assumptions and Theoretical Framework

The function approximation analysis is established over a function class F\mathcal{F} characterized by both smoothness and sparsity. The main regularity condition is (r,c0)(r, c_0)-Lipschitz continuity: for all x,xIdx, x' \in \mathbb{I}^d,

fρ(x)fρ(x)c0xxr|f_\rho(x) - f_\rho(x')| \leq c_0 \|x - x'\|^r

Additionally, a structural prior assumes ss-sparsity in a partition of Id\mathbb{I}^d into NdN^d cubes: a function is ss-sparse if its support is contained entirely within ss of these cubes.

The hypothesis space for learning is a family Φn,2d\Phi_{n,2d} of two-hidden-layer deep networks with bounded parameters (bounds Bn,Cn,Ξn\mathcal{B}_n, \mathcal{C}_n, \Xi_n). The second-layer activation σ\sigma must satisfy a Lipschitz condition σ(t)σ(t)Cσtt|\sigma(t) - \sigma(t')| \leq C_\sigma |t - t'|. Capacity of Φn,2d\Phi_{n,2d} is measured using covering numbers, specifically in C(Id)C(\mathbb{I}^d): logN(ϵ,Φn,2d)=O(ndlog(n/ϵ))\log \mathcal{N}(\epsilon, \Phi_{n,2d}) = \mathcal{O}(n^d \log(n/\epsilon)) This bound matches that for shallow networks of width nn, so that increased depth does not inflate learning-theoretic capacity.

3. Capacity Control and Learning Rates

Despite the increased expressivity, the network capacity (as quantified by the covering number) remains on par with shallow architectures. This key observation enables trading off approximation and estimation error for empirical risk minimization (ERM) training. The rate for smooth function approximation is

O(m2r/(2r+d)polylog(m))\mathcal{O}(m^{-2r/(2r + d)} \cdot \operatorname{polylog}(m))

matching minimax rates up to log factors. If the true function is ss-sparse in a partition of NdN^d cubes, the approximation rate improves: O(m2r/(2r+d)(sNd)d/(2r+d)polylog(m))\mathcal{O}\bigg(m^{-2r/(2r + d)}\left(\frac{s}{N^d}\right)^{d/(2r+d)} \cdot \operatorname{polylog}(m)\bigg) This demonstrates the advantage of architectures capable of exploiting sparsity, as deep nets can, over shallow nets.

The excess risk for the ERM estimator is decomposed: E(πMfD,n)E(fρ)\mathcal{E}(\pi_M f_{D, n}) - \mathcal{E}(f_\rho) into approximation error (how well Φn,2d\Phi_{n, 2d} can match fρf_\rho) and sample error (capacity-limited generalization), with both controlled thanks to the bounded covering number.

4. Practical Implementation: ERM and Architectural Considerations

The paper’s analysis is directly instantiated in ERM learning. For mm i.i.d. samples Dm={(xi,yi)}i=1mD_m = \{(x_i, y_i)\}_{i=1}^m, the estimator is

fD,n=argminfΦn,2d1mi=1m[f(xi)yi]2f_{D, n} = \arg \min_{f \in \Phi_{n,2d}} \frac{1}{m} \sum_{i=1}^m [f(x_i) - y_i]^2

Outputs are projected onto [M,M][-M, M] if the true target is known to be bounded: πM(t)={Mt>M ttM Mt<M\pi_M(t) = \begin{cases} M & t > M \ t & |t| \leq M \ -M & t < -M \end{cases} Generalization is controlled via covering-number-based concentration inequalities (Bernstein's inequality), allowing derivation of learning rates as described above.

The architectural template—two hidden layers, localized first-stage (step-like) activation, smooth sigmoidal second-stage—enables both the construction of spatial indicator functions and their aggregation for sparse, complex targets.

5. Implications for Expressivity in High-Dimensional Function Spaces

The central message is that deep nets realize a qualitative expressivity advantage—only deep architectures (with at least two hidden layers) can produce localized “mask” functions or sparse aggregations, fundamental for approximating functions with localized or sparse behavior. Shallow networks with matching width and parameter budget fundamentally lack this capability.

Moreover, this expressivity does not come with the cost of increased statistical complexity; capacity, as measured by the covering number, scales essentially the same as for shallow networks. This is nontrivial, as one might otherwise expect resource-hungry deep nets to suffer decreased generalization.

Consequently, deep networks present a compelling architecture in learning settings with high-dimensional, sparse, or spatially fragmented targets, as they can match expressivity to function structure without sacrificing learning-theoretic guarantees.

6. Learning Theory Perspective: Balancing Expressivity and Capacity

From an architectural learning theory standpoint, the work demonstrates that by carefully combining depth for increased expressivity with parameter constraints for bounded capacity, one can achieve near-optimal statistical learning rates. This alignment is achieved through a network that encodes spatial structure in the first hidden layer and smooth, bounded nonlinearities in the second, allowing for both localized approximation and controlled variance.

In summary, the paper provides a theoretical foundation showing that deep networks—by virtue of depth-enabled localized and sparse approximations—attain both greater function expressivity and optimal (or even improved when sparse-structured) learning rates, all while maintaining a statistical capacity comparable to shallow architectures. This justifies depth as an architectural principle for complex function approximation in high dimensions when both expressivity and generalization are required (Lin, 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Expressivity Assumptions on Function Approximation.