Shallow ReLU Neural Denoisers

Updated 30 June 2025

Shallow ReLU neural network denoisers are single-hidden-layer models using ReLU activation to recover clean signals from noisy data with proven statistical error bounds.
They employ encoder-generator architectures and optimization techniques like LASSO to achieve near-minimax error rates in both high- and low-dimensional regimes.
Specialized strategies such as greedy node selection and structure-guided Gauss-Newton methods enhance training efficiency and robustness against outliers.

Shallow ReLU neural network denoisers are single-hidden-layer neural architectures employing the rectified linear unit (ReLU) activation, widely studied for their ability to recover clean signals from noisy data. These models have been extensively analyzed in terms of statistical recovery guarantees, expressive power, optimization landscape, and their practical limitations and strengths for denoising in both high- and low-dimensional regimes. The theory spans from representation learning and nonlinear dictionary recovery to robust compressed sensing, minimax estimation properties, and questions of finite-sample identifiability.

1. Generative Models, Denoising Architectures, and Error Bounds

A canonical shallow ReLU denoising model takes the form: $\mathbf{y} = \mathrm{ReLU}(A\mathbf{c} + \mathbf{b})$ where $\mathbf{y} \in \mathbb{R}^d$ is the observed (possibly noisy) signal, $A \in \mathbb{R}^{d \times k}$ is an unknown (or learned) weight matrix, $\mathbf{c} \in \mathbb{R}^k$ is a latent code, and $\mathbf{b} \in \mathbb{R}^d$ is a bias (often random across observations) (1803.04304). For practical denoising:

Encoder-generator architectures employ an encoder $E: \mathbb{R}^n \to \mathbb{R}^k$ and a generator $G: \mathbb{R}^k \to \mathbb{R}^n$ , both using ReLU activations (1805.08855).
Both direct feedforward (autoencoder) denoising and optimization-based approaches (projecting noisy data onto the range of a shallow generator) are provably effective.

Key theoretical result: For an encoder-generator with code dimension $k$ and output dimension $n$ , the mean-squared error (MSE) of denoising under additive noise $\eta$ scales as

$\mathbb{E}\| \hat{y} - y_\ast \|^2 \lesssim \sigma^2 \frac{k}{n}$

where $\sigma^2$ is the noise variance (1805.08855).

For robust denoising with outliers, shallow ReLU networks analyzed with a generalized LASSO decoder achieve error bounds: $\|\mu\mathbf{c}^* - \hat{\mathbf{c}}\|_2 + \frac{\|\mathbf{e}^* - \hat{\mathbf{e}}\|_2}{\sqrt{d}} \leq \tilde{C} \max\Big\{\sqrt{\frac{k\log k}{d}},\, \sqrt{\frac{s\log d}{d}}\Big\}$ where $s$ is the outlier sparsity (1803.04304).

2. Role of the ReLU Activation and Piecewise-Linear Representation

ReLU activation ( $\mathrm{ReLU}(x) = \max\{0, x\}$ ) introduces sparsity (by truncating to zero) and nonlinearity (by masking negative values), resulting in observed data that is partially masked and highly nonlinear (1803.04304). Its piecewise-linear properties:

Enable effective convex relaxations and tractable optimization for recovery (e.g., maximum-likelihood programs, LASSO-style estimators).
Allow shallow ReLU networks to behave as unions of low-dimensional subspace projections in each activation region (1805.08855).

Integral representations further reveal that, in the infinite-width limit, ReLU denoisers correspond to integrating ReLU ridge functions against a measure on parameter space, with minimal $L_1$ -norm weights depending on the second derivatives (curvature) of the target function (1910.02743).

3. Optimization Landscape, Existence of Minima, and Training Stability

Analysis of the optimization landscape exposes that shallow ReLU networks enjoy well-behaved, attainable minima—contrasting with smooth activations (sigmoid, tanh), where global minimizers often do not exist and training dynamics can diverge (2303.03950, 2302.14690). Specifically:

For strictly convex losses and continuous data distributions, the global minimum exists and is realisable by a representable ReLU network (2303.03950).
Generalized responses (limits of sequences of network parameters) never outperform actual representable ReLU networks, so optimization is robust and stable.

This theoretical guarantee directly supports the practical reliability of ReLU-based shallow denoisers under gradient-based training.

4. Denoising Performance: Adaptive Regimes and Minimax Optimality

Empirical and theoretical studies establish that the denoising capacity of shallow ReLU networks adapts to both data and initialization:

Kernel regime: With large first-layer initialization or strong regularization, the network's function-space evolution approximates a cubic spline interpolator, yielding smooth, global denoising (1906.07842).
Adaptive regime: With small initialization, shallow networks may overfit, reproducing a linear spline with knots at each data point (potentially capturing noise) (1906.07842).

In multivariate settings, shallow ReLU network estimators regularized with $\ell^2$ or path norm achieve near-minimax rates over the second-order Radon-domain bounded variation space, a class that includes both smooth and non-smooth functions, and crucially, overcomes the curse of dimensionality: $\mathbb{E}\left[\frac{1}{N} \sum_{n=1}^N |f(\vec{x}_n) - \hat{f}(\vec{x}_n)|^2 \right] \lesssim \widetilde{O}\left( N^{-\frac{d+3}{2d+3}} \right)$ where $N$ is the sample size and $d$ the dimension (2109.08844).

For classes of Lipschitz or continuous piecewise linear functions, shallow ReLU networks achieve optimal rates under early-stopped gradient descent, requiring only minimal regularization (2212.13848, 2307.12461). Early stopping in gradient descent is crucial for preventing overfitting to noise, enabling practical, universal denoisers for non-smooth signals.

5. Expressivity, Limitations, and Depth Separation

While shallow ReLU networks are universal approximators in theory, their expressive power is limited in high-dimensional regimes:

For certain multivariate polynomial targets (e.g., $p_d(x) = \prod_{i=1}^d x_i$ ), shallow ReLU networks require an exponentially large number of neurons in $d$ to achieve small approximation error; this is a manifestation of the curse of dimensionality (2303.03544).
For normalized target functions or over the unit cube, polynomial-size shallow networks suffice.

This has direct relevance for denoising complex signals: when target mappings involve high-degree interactions or exhibit sharp features not aligned with piecewise linear regions, depth is necessary for parametric efficiency and representational fidelity.

Depth separation is also encountered in the context of minima stability: only sufficiently smooth functions (in the Sobolev sense) can be reliably reproduced by stable, shallow ReLU denoisers under practical learning rates, while less smooth (e.g., pyramid-type) functions require greater depth (2306.17499).

6. Finite Sample Identifiability and Implications for Denoising Reliability

Recent advances establish a sharp distinction between ReLU and analytic activations in terms of identifiability via finite sampling:

ReLU networks: No finite, universal sample set suffices to uniquely identify a given irreducible shallow ReLU network among others of the same width, due to the flexibility of hyperplane arrangements and the piecewise-linear construction (2503.12744).
Analytic activations: For activations such as sigmoid or tanh, it is possible to construct universal, finite test sets that uniquely determine an irreducible network from its output values (2503.12744).

For practical denoising, this demarcates a limitation for ReLU-based networks when identifiability or verifiable uniqueness of the learned mapping is required. For tasks prioritizing interpretability or reconstructive fidelity from finite measurements, analytic activations may be preferable. For ReLU networks, network-specific distinguishing test sets can still be constructed to verify uniqueness post-training.

7. Denoising Methodologies: Greedy, Structure-Guided, and Subspace-Informed Approaches

Shallow ReLU neural network denoisers benefit from specialized algorithmic approaches:

Greedy node selection with ridgelet-transform-based dictionary reduction yields sparse, robust networks with controlled complexity, improving both training and inference efficiency (1905.10409).
Structure-guided Gauss-Newton methods exploit the block structure of shallow ReLU networks, alternating efficient solvers for linear and nonlinear parameters, leading to rapid convergence and precise break placement in signals with discontinuities—a significant advantage for denoising tasks with sharp edges (2404.05064).
Reduced-basis and Grassmann-layer networks explicitly project noisy data onto subspaces aligned with dominant data directions, filtering out noise before nonlinear regression, and are particularly effective in data-scarce and high-dimensional denoising settings (2012.09940).

These methods leverage the intrinsic structure of shallow ReLU architectures, enabling improved denoising performance, lower risk of overfitting, and resource-efficient deployment in practice.

In summary, shallow ReLU neural network denoisers constitute a theoretically robust, practically effective class of models for signal and image denoising, supported by guarantees in regularization, error bounds, and adaptivity. While they are nearly minimax-optimal for a broad class of smooth and piecewise-linear targets, their expressivity is fundamentally limited by depth and identifiability constraints in high dimensions and under finite sampling. Nevertheless, with proper initialization, algorithmic regularization (including early stopping and structured optimization), and architectural selection, they remain central to modern denoising methodology and theory.