Non-Penalized Deep Neural Network Estimator

Updated 30 December 2025

NPDNN is an empirical risk minimizer over a structurally-constrained DNN class that attains minimax-optimal rates for regression and classification.
It controls model complexity using explicit constraints on depth, width, sparsity, and weight norms without additional regularization penalties.
Theoretical analysis demonstrates its robustness across i.i.d. and dependent data, varied loss functions, and smoothness regimes.

A non-penalized deep neural network estimator (NPDNN) is an empirical risk minimization approach over a constrained class of DNNs, with explicit constraints on network architectural parameters but without additional regularization penalties. NPDNN targets minimax-optimal estimation rates in nonparametric regression, classification, and related models, with theoretical guarantees established under general dependence structures, loss functions, and smoothness regimes. NPDNNs have been rigorously analyzed for i.i.d. and dependent data, regression and classification settings, Hellinger and $L^2$ losses, and are motivated by achieving sharp upper bounds for risk without redundant logarithmic factors (Liu et al., 2019, Kengne et al., 29 Dec 2025, Yara et al., 2024, Kurisu et al., 2022).

1. Statistical Framework and Formal Definition

Let $D_n = \{(X_i, Y_i)\}_{i=1}^n$ be a sample, either i.i.d. or from an ergodic mixing process, with $X_i \in \mathbb{R}^d$ and $Y_i$ scalar or vector-valued. The objective is to estimate a function $f_0$ (e.g., regression mean or class-probability) minimizing a population risk

$R(h) = \mathbb{E}[\ell(h(X_0), Y_0)]$

over a class of measurable functions $h:\mathcal{X}\rightarrow\mathbb{R}$ and loss function $\ell:\mathbb{R} \times \mathcal{Y} \to [0, \infty)$ . The excess risk is $\mathcal{E}(h) = R(h) - R(h^*)$ for a true minimizer $h^*$ .

The NPDNN is defined as the empirical risk minimizer over a structurally-constrained DNN class:

$\widehat h_{n,NP} = \arg\min_{h \in \mathcal H_\sigma(L_n, N_n, B_n, F_n, S_n)} \frac{1}{n} \sum_{i=1}^n \ell(h(X_i), Y_i)$

where the hypothesis class is specified by:

Depth $L_n$
Maximum width $N_n$ per layer
Uniform sup-norm bound on weights $B_n$
Output bound $F_n$
Sparsity constraint (number of nonzero weights/biases) $S_n$

No explicit penalty (e.g., $\ell^1$ or $\ell^0$ ) is added; control of overfitting is achieved solely by proper scaling of $(L_n, N_n, B_n, S_n)$ with sample size and problem complexity (Kengne et al., 29 Dec 2025).

2. Risk Bounds and Minimax-Optimal Rates

For regression, logistic regression, and general losses satisfying suitable local curvature and Lipschitz conditions, NPDNN attains (up to logarithmic factors) minimax rates of excess risk over Hölder and compositional smoothness classes (Liu et al., 2019, Kengne et al., 29 Dec 2025, Yara et al., 2024).

Pure Hölder Case

If $f_0 \in \mathcal{C}^s(\mathcal{X},K)$ (Hölder smoothness $s$ ) and the loss is quadratic or satisfies $\kappa$ -local curvature, the rate is:

$\mathbb{E}[\mathcal{E}(\widehat h_{n,NP})] \lesssim \frac{(\log n)^\nu}{n^{\frac{\kappa s}{\kappa s + d}}}$

for $\kappa = 2$ (e.g., squared or logistic loss) and some $\nu > 3$ (Kengne et al., 29 Dec 2025).

Compositional Smoothness

If $f_0$ belongs to a multi-layer compositional class of depth $q$ , with layerwise smoothness $\boldsymbol{\beta}$ and sparsity $\mathbf{t}$ , the critical rate is

$\phi_n = \max_{0 \leq i \leq q} n^{-2 \beta^*_i/(2\beta^*_i + t_i)}$

where $\beta^*_i = \beta_i \prod_{j=i+1}^q (\beta_j \wedge 1)$ (Kengne et al., 29 Dec 2025, Yara et al., 2024). The risk converges as

$\mathbb{E}[\mathcal{E}(\widehat h_{n,NP})] \lesssim (\phi_n^{\kappa/2} \vee \phi_n) (\log n)^\nu.$

Precise $L^2$ Minimax in Nonparametric Regression

With explicit B-spline construction, for $f_0$ in a Hölder (Nikolskii) ball and appropriate network architecture, the minimax-optimal rate

$\sup_{f_0 \in \Lambda^\beta(F)} \mathbb{E}\|\widehat f_n - f_0\|_{L^2}^2 = O(n^{-2\beta/(2\beta + d)})$

is achieved with no extraneous logarithmic term (Liu et al., 2019).

3. Key Methodologies and Neural Network Architecture

NPDNN imposes architectural constraints:

Depth $L_n$ , width $N_n$ scale logarithmically/polynomially with sample size, depending on smoothness and dimension.
Sparsity parameter $S_n$ and sup-norm bound $B_n$ prevent uncontrolled model complexity.
Approximation properties of DNNs (e.g., ReLU networks for Hölder or compositional classes) guarantee that, for sufficiently large $(L_n, N_n, S_n)$ , the approximation error matches the minimax risk rate.
For regression, an explicit construction via embedding B-spline basis into DNNs allows the estimator

$\widehat f_n(x) = \sum_{j \in \Gamma} \widehat b_j \widetilde D_{j,k}(x)$

where each $\widetilde D_{j,k}$ approximates the B-spline $D_{j,k}$ by a DNN (Liu et al., 2019).

In multi-class logistic regression, the empirical cross-entropy loss is minimized over sparse ReLU networks with softmax output. Theoretical covering number bounds for these networks underpin the rate analysis (Yara et al., 2024).

4. Statistical Properties: Adaptivity, Generalization, and Dependence Structure

NPDNNs require explicit selection of architecture parameters based on problem smoothness and sample size. The estimator is not adaptive: achieving minimaxity depends on tuning $S_n$ , $L_n$ , $N_n$ at the scale implied by the target smoothness and composition structure (Kengne et al., 29 Dec 2025, Kurisu et al., 2022). In contrast, sparse-penalized DNNs (SPDNN) incorporating $\ell^0$ -type penalties can adapt automatically.

The guarantees of NPDNN extend across a spectrum of dependence models, owing to the proof’s reliance on generalized Bernstein-type inequalities:

i.i.d. and $\phi$ -mixing: rates as above.
Exponentially and subexponentially $\alpha$ -mixing, geometric and polynomial $\mathcal{C}$ -mixing: effective sample size is reduced (e.g., $n/(\log n)^2$ ) and rates adjust accordingly (Kengne et al., 29 Dec 2025, Kurisu et al., 2022).

5. Extension to Sup-Norm and Adversarial Risk

Standard NPDNN with least-squares loss and standard empirical training does not guarantee minimax-optimal sup-norm error due to possible local overfitting or oscillations. However, an adversarially-trained variant with correction achieves the minimax sup-norm rate, provided label preprocessing is used:

$E[\|\hat f - f^*\|_\infty^2] \leq C h^{-d} \biggl\{ \frac{(LW)^2\log(LW)\log n}{n} + (LW)^{-4\beta/d} + h^{-d}\zeta_n^2 \biggr\}$

where $\zeta_n^2$ reflects bias from label preprocessing, $h$ is adversarial perturbation radius, and $LW$ are network size parameters (Imaizumi, 2023).

Optimization proceeds as an alternating minimax game, with adversarial data augmentation and SGD-like updates, and selection of $(L,W)$ matches the optimal sup-norm rate $n^{-2\beta/(2\beta+d)}$ up to log factors (Imaizumi, 2023).

6. Hypothesis Testing and Asymptotic Inference

For nonparametric $L^2$ regression, NPDNNs allow construction of minimax-optimal hypothesis tests for $H_0: f_0 = 0$ . The test statistic

$T_n = \frac{1}{n}\sum_{i=1}^n \widehat f_{\text{net}}(X_i)^2$

standardized appropriately, converges in distribution to a standard normal under the null. The test achieves minimax separation rates at the Ingster boundary $n^{-2\beta/(4\beta + d)}$ (Liu et al., 2019).

Furthermore, under mild additional regularity, the network estimator is pointwise asymptotically normal:

$\frac{\widehat f_{\rm net}(x)-f_0(x)}{\sqrt{D_k(x)^T( \Phi^T\Phi)^{-1}D_k(x)}} \stackrel{D}{\rightarrow} N(0,1)$

with the approximation error negligible relative to the stochastic term (Liu et al., 2019).

7. Practical Considerations and Empirical Guidelines

The depth of NPDNN grows only logarithmically with sample size, while width and sparsity scale as mild polynomials in $n$ and dimension (Kengne et al., 29 Dec 2025).
Explicit sup-norm or weight-size constraints $B_n$ prevent parameter blowup.
In the absence of structural adaptivity (as in penalized methods), cross-validation or theoretical pilot estimation of smoothness is necessary to select architecture.
Even for nonstationary or highly-dependent time series, as long as the mixing or dependence function is quantifiable, scaling the effective sample size in the rate formulas retains theoretical guarantees (Kurisu et al., 2022).
In multivariate and logistic regression, achieving minimax Hellinger risk rates requires sparse, deep ReLU architectures, with softmax output, and the approach is robust against the divergence of Kullback–Leibler excess risk in excessively flexible models (Yara et al., 2024).

References:

"Optimal Nonparametric Inference via Deep Neural Network" (Liu et al., 2019)
"A general framework for deep learning" (Kengne et al., 29 Dec 2025)
"Nonparametric logistic regression with deep learning" (Yara et al., 2024)
"Adaptive deep learning for nonlinear time series models" (Kurisu et al., 2022)
"Sup-Norm Convergence of Deep Neural Network Estimator for Nonparametric Regression by Adversarial Training" (Imaizumi, 2023)

Markdown Report Issue Upgrade to Chat

References (5)

Optimal Nonparametric Inference via Deep Neural Network (2019)

A general framework for deep learning (2025)

Nonparametric logistic regression with deep learning (2024)

Adaptive deep learning for nonlinear time series models (2022)

Sup-Norm Convergence of Deep Neural Network Estimator for Nonparametric Regression by Adversarial Training (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Non-Penalized Deep Neural Network Estimator (NPDNN).

Non-Penalized Deep Neural Network Estimator

1. Statistical Framework and Formal Definition

2. Risk Bounds and Minimax-Optimal Rates

Pure Hölder Case

Compositional Smoothness

Precise $L^2$ Minimax in Nonparametric Regression

3. Key Methodologies and Neural Network Architecture

4. Statistical Properties: Adaptivity, Generalization, and Dependence Structure

5. Extension to Sup-Norm and Adversarial Risk

6. Hypothesis Testing and Asymptotic Inference

7. Practical Considerations and Empirical Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Non-Penalized Deep Neural Network Estimator

1. Statistical Framework and Formal Definition

2. Risk Bounds and Minimax-Optimal Rates

Pure Hölder Case

Compositional Smoothness

Precise L2L^2L2 Minimax in Nonparametric Regression

3. Key Methodologies and Neural Network Architecture

4. Statistical Properties: Adaptivity, Generalization, and Dependence Structure

5. Extension to Sup-Norm and Adversarial Risk

6. Hypothesis Testing and Asymptotic Inference

7. Practical Considerations and Empirical Guidelines

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Precise $L^2$ Minimax in Nonparametric Regression