Approximation and interpolation of deep neural networks (2304.10552v2)

Published 20 Apr 2023 in cs.LG, math.OC, math.PR, and stat.ML

Abstract: In this paper, we prove that in the overparametrized regime, deep neural network provide universal approximations and can interpolate any data set, as long as the activation function is locally in $L^1(\RR)$ and not an affine function. Additionally, if the activation function is smooth and such an interpolation networks exists, then the set of parameters which interpolate forms a manifold. Furthermore, we give a characterization of the Hessian of the loss function evaluated at the interpolation points. In the last section, we provide a practical probabilistic method of finding such a point under general conditions on the activation function.

Summary

The paper proves that deep neural networks in the overparameterized regime achieve universal approximation and interpolation with non-affine activations.
It characterizes the parameter solution space as an (n-d)-dimensional submanifold and links the findings to the double descent phenomenon.
A novel probabilistic method reduces the required hidden neurons from O(d log^2 d) to O(d log d), improving interpolation efficiency in practice.

Approximation and Interpolation of Deep Neural Networks

The paper by Vlad Raul Constantinescu and Ionel Popescu presents significant theoretical advancements in the understanding of interpolation and approximation capabilities of deep neural networks in the overparameterized regime. The authors focus on establishing conditions under which neural networks can universally approximate functions and interpolate datasets, emphasizing the role of the activation function.

The paper rigorously proves that deep neural networks, when overparameterized, are capable of universal approximation and interpolation of any dataset, assuming the activation function is locally integrable and non-affine. This completes gaps in the literature by extending previous results that considered continuous non-polynomial activations, demonstrating that under these conditions, a dataset with $d$ distinct points can be interpolated using a neural network with width at least $d$ in each hidden layer.

Key results also include the characterization of the parameter space capable of interpolation, revealing that the solution space forms an $n-d$ dimensional submanifold, where $n$ is the number of parameters in the network. This manifold characterization intersects with the double descent phenomenon, offering insights into the geometry of the loss landscape at the interpolation threshold.

For the practical numerical resolution of finding interpolation points, the authors introduce a methodology involving the random initialization of input-to-hidden weights and optimization over the output layer. This approach refines previous findings by reducing the needed overparameterization from $O(d \log^2 d)$ hidden neurons to $O(d \log d)$ .

Expanding on network density results, the work generalizes the uniform convergence of deep networks over compact sets, showing that deep networks are dense in the space of continuous functions if the activation function is non-affine, irrespective of depth. This aligns with established results while extending them to a broader class of function spaces and neural network architectures.

The paper notably explores the Hessian eigenspectrum at the global minima, establishing that at these interpolation points, the Hessian matrix contains $d$ positive eigenvalues and $n-d$ zero eigenvalues, providing valuable theoretical insights into the curvature of loss landscapes within overparameterized regimes.

The implications of these findings are profound for both theoretical and applied fields. The universality of deep neural networks underlined by this work lays a solid foundation for designing robust predictive models across diverse domains. Moreover, the probabilistic method introduced here for achieving full-rank interpolation matrices could inform new strategies for more efficient neural network training, particularly in non-convex settings.

Future research may explore these theoretical advancements by addressing practical considerations in network training, such as computational efficiency and robustness to noise. Extending these results to various network architectures, including recurrent and convolutional networks, could further elaborate on the robustness and versatility of overparameterized neural networks in real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StatMLPapers/status/1783708568610382300

https://twitter.com/realmofresearch/status/1785330381501395156