Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Best k-layer neural network approximations (1907.01507v2)

Published 2 Jul 2019 in cs.LG and stat.ML

Abstract: We show that the empirical risk minimization (ERM) problem for neural networks has no solution in general. Given a training set $s_1, \dots, s_n \in \mathbb{R}p$ with corresponding responses $t_1,\dots,t_n \in \mathbb{R}q$, fitting a $k$-layer neural network $\nu_\theta : \mathbb{R}p \to \mathbb{R}q$ involves estimation of the weights $\theta \in \mathbb{R}m$ via an ERM: [ \inf_{\theta \in \mathbb{R}m} \; \sum_{i=1}n \lVert t_i - \nu_\theta(s_i) \rVert_22. ] We show that even for $k = 2$, this infimum is not attainable in general for common activations like ReLU, hyperbolic tangent, and sigmoid functions. A high-level explanation is like that for the nonexistence of best rank-$r$ approximations of higher-order tensors --- the set of parameters is not a closed set --- but the geometry involved for best $k$-layer neural networks approximations is more subtle. In addition, we show that for smooth activations $\sigma(x)= 1/\bigl(1 + \exp(-x)\bigr)$ and $\sigma(x)=\tanh(x)$, such failure to attain an infimum can happen on a positive-measured subset of responses. For the ReLU activation $\sigma(x)=\max(0,x)$, we completely classifying cases where the ERM for a best two-layer neural network approximation attains its infimum. As an aside, we obtain a precise description of the geometry of the space of two-layer neural networks with $d$ neurons in the hidden layer: it is the join locus of a line and the $d$-secant locus of a cone.

Citations (4)

Summary

We haven't generated a summary for this paper yet.