Non-identifiability distinguishes Neural Networks among Parametric Models (2504.18017v2)

Published 25 Apr 2025 in math.ST, cs.LG, stat.ML, and stat.TH

Abstract: One of the enduring problems surrounding neural networks is to identify the factors that differentiate them from traditional statistical models. We prove a pair of results which distinguish feedforward neural networks among parametric models at the population level, for regression tasks. Firstly, we prove that for any pair of random variables $(X,Y)$, neural networks always learn a nontrivial relationship between $X$ and $Y$, if one exists. Secondly, we prove that for reasonable smooth parametric models, under local and global identifiability conditions, there exists a nontrivial $(X,Y)$ pair for which the parametric model learns the constant predictor $\mathbb{E}[Y]$. Together, our results suggest that a lack of identifiability distinguishes neural networks among the class of smooth parametric models.

Summary

The paper demonstrates that neural networks always learn non-trivial relationships (MSE < Var(Y)) if any exist, unlike identifiable models that can learn nothing beyond the mean.
It establishes that neural networks’ non-identifiability, via flexible architectures and universal approximation, enables better adaptation to complex data patterns.
The work provides practical guidelines for model selection and network design, highlighting trade-offs between interpretability in identifiable models and the robust learning of neural networks.

The paper "Non-identifiability distinguishes Neural Networks among Parametric Models" (2504.18017) explores a fundamental theoretical difference between feedforward neural networks and traditional smooth parametric models like linear or logistic regression, specifically concerning their ability to learn relationships in data at the population level. The core finding is that the inherent non-identifiability of neural networks distinguishes them from identifiable parametric models, allowing them to capture any existing non-trivial relationship between variables, a property not universally shared by identifiable models.

The paper establishes two main theoretical results that have significant implications for understanding why neural networks are effective regressors:

Neural Networks Always Weakly Learn (Theorem 1): For any pair of random variables $(X,Y)$ where $X$ is the input and $Y$ is the outcome, if there is any learnable relationship (i.e., $\E[\Var[Y | X]] < \Var(Y)$), then the best possible feedforward neural network with a suitable architecture will achieve a mean squared error (MSE) strictly less than $\Var(Y)$ (the MSE of the constant predictor $\E[Y]$). This means neural networks are guaranteed to learn some non-trivial aspect of the relationship if one exists.
Identifiable Models Can Fail to Learn (Theorem 2): For most reasonable smooth parametric models that satisfy conditions of local and strong identifiability (meaning parameters are uniquely determined by the function they represent, at least locally), there exists a data distribution $(X,Y)$ with a non-trivial relationship ($\E[\Var[Y | X]] < \Var(Y)$) for which the best-fitting model in that class is the constant prediction $\E[Y]$. This highlights a failure mode for identifiable models where they might learn absolutely nothing useful from certain data.

The key differentiator identified is identifiability. In a parametric model $f_\theta(x)$ , parameters $\theta$ are identifiable if different parameter values always correspond to different functions, i.e., $f_{\theta_1}(x) = f_{\theta_2}(x)$ for all $x$ implies $\theta_1 = \theta_2$ . Neural networks, especially those with hidden layers and standard activation functions, are notoriously non-identifiable. For example, swapping the weights and biases of two neurons in the same hidden layer, or scaling weights and biases across layers appropriately with certain activation functions, results in the same function but different parameter values.

Consider the practical example discussed in the paper:

Logistic Regression: $f^{\sf{log}_\theta(x) = \frac{e^{\alpha + \beta x}{1+e^{\alpha + \beta x}$ with $\theta = (\alpha, \beta)$ .
One-Layer Neural Network: $f^{\sf{NN}_\theta(x) = \gamma + \delta \frac{e^{\alpha + \beta x}{1+e^{\alpha+\beta x}$ with $\theta = (\alpha, \beta, \gamma, \delta)$ .

The neural network model includes the logistic function structure but adds two extra parameters, $\gamma$ and $\delta$ . This seemingly small addition fundamentally changes its identifiability. As shown in the paper, parameters like $(\alpha,\beta,\gamma,\delta) = (0,0,0,1)$ and $(\alpha',\beta',1/2,0)$ can produce the same constant function $f^{\sf{NN}_\theta(x)=1/2$. This violates strong identifiability. The local identifiability condition (invertible Fisher information matrix) also fails for the NN model at $\beta=0$ because the gradient components related to $\gamma$ and $\delta$ become constant, leading to a rank-deficient gradient outer product matrix.

In contrast, standard logistic regression is identifiable under mild conditions on the data $X$ . Because logistic regression is identifiable and satisfies the conditions of Theorem 2, the paper shows there exists a distribution $(X,Y)$ where logistic regression optimally learns the constant function $\E[Y]$, even if a non-trivial relationship exists. Specifically, if $\E[Y|X]$ is of the form $\frac{1}{2} + \epsilon g_0(X)$ for a small $\epsilon$ and a specific function $g_0$ (like a bounded even function orthogonal to linear terms in $X$ ), the best logistic fit is $1/2 = \E[Y|X]$ when $\epsilon=0$ , thus learning nothing beyond the mean.

The mechanism behind Theorem 1 (neural networks always learn) relies on their universal approximation capabilities (Lemma 2). The paper leverages the fact that neural networks can approximate affine transformations of indicator functions for half-spaces (functions like $c_2 \mathbf{1}\{\alpha^\top X \leq c_1\} + c_0$ ) arbitrarily well in $L^2$ . Lemma 1 shows that if any non-trivial relationship $\E[\Var(Y|X)] < \Var(Y)$ exists, then $\E[Y|X]$ must be correlated with such a half-space indicator function for some $\alpha, c_1$ . Since neural networks can approximate the best linear predictor based on this indicator (Equation 6), they can achieve MSE less than $\Var(Y)$.

Practical Implementation Implications:

Model Selection: This research provides a theoretical justification for choosing neural networks when you suspect a potentially complex or subtle relationship in the data, especially in settings where simpler, identifiable models like linear regression or GLMs might fail to capture anything due to structural constraints. If your problem domain requires capturing any signal present in the data, NNs offer a theoretical guarantee of doing so at the population level, whereas identifiable models do not.
Architecture Choice: The universal approximation capabilities rely on sufficient network capacity (width and depth) and appropriate activation functions. Lemma 2 provides specific architectural requirements:
- Tanh-form activations (sigmoid, tanh, etc.): Require at least one hidden layer ( $D \geq 1$ ) and work regardless of the last hidden layer width ( $w_D$ ).
- ReLU activations: Also require $D \geq 1$ hidden layers but specifically need the last hidden layer width $w_D \geq 2$ . This is a practical consideration for network design when using ReLU, although typical modern architectures have much larger widths anyway.
Data Considerations: Theorem 1's proof for tanh-form activations requires $X$ to have a density, while the ReLU case does not. For Theorem 2 to apply to identifiable models, the support of $X$ must be large enough ( $> d+1$ ) and satisfy conditions for invertible Fisher information, which might fail for specific degenerate data distributions.
Trade-offs: While NNs offer a theoretical guarantee of learning something, this doesn't come for free. Identifiable models still offer advantages:
- Interpretability: Parameters in identifiable models often have clear statistical interpretations (e.g., regression coefficients). NN parameters are notoriously hard to interpret due to non-identifiability.
- Computational Cost: Training and deploying complex neural networks is generally far more computationally expensive than identifiable GLMs.
- Data Efficiency: In scenarios where the true relationship is well-described by a simple identifiable model, these models might learn more efficiently from limited data compared to potentially over-parameterized NNs.
Optimization Landscape: The paper touches upon the Fisher information matrix spectrum in neural networks, linking the theoretical failure of local identifiability to practical optimization challenges (e.g., flat minima). While the paper focuses on the population optimum, the empirical observation that NN training succeeds despite non-identifiability suggests that optimization algorithms like gradient descent can find parameters that achieve the low population MSE, even if these parameters are not unique.

In summary, this research reinforces the view that neural networks' power stems partly from their flexible, non-identifiable structure. This allows them to avoid the "learn nothing" pitfall that can affect constrained, identifiable models for certain data distributions, making them robust universal learners at the population level. Practically, this supports the use of NNs when the true relationship is unknown or complex, while also reminding practitioners of the benefits (interpretability, efficiency) offered by identifiable models when their assumptions align with the problem.