Papers
Topics
Authors
Recent
2000 character limit reached

Non-Penalized Deep Neural Network Estimator

Updated 30 December 2025
  • NPDNN is an empirical risk minimizer over a structurally-constrained DNN class that attains minimax-optimal rates for regression and classification.
  • It controls model complexity using explicit constraints on depth, width, sparsity, and weight norms without additional regularization penalties.
  • Theoretical analysis demonstrates its robustness across i.i.d. and dependent data, varied loss functions, and smoothness regimes.

A non-penalized deep neural network estimator (NPDNN) is an empirical risk minimization approach over a constrained class of DNNs, with explicit constraints on network architectural parameters but without additional regularization penalties. NPDNN targets minimax-optimal estimation rates in nonparametric regression, classification, and related models, with theoretical guarantees established under general dependence structures, loss functions, and smoothness regimes. NPDNNs have been rigorously analyzed for i.i.d. and dependent data, regression and classification settings, Hellinger and L2L^2 losses, and are motivated by achieving sharp upper bounds for risk without redundant logarithmic factors (Liu et al., 2019, Kengne et al., 29 Dec 2025, Yara et al., 2024, Kurisu et al., 2022).

1. Statistical Framework and Formal Definition

Let Dn={(Xi,Yi)}i=1nD_n = \{(X_i, Y_i)\}_{i=1}^n be a sample, either i.i.d. or from an ergodic mixing process, with XiRdX_i \in \mathbb{R}^d and YiY_i scalar or vector-valued. The objective is to estimate a function f0f_0 (e.g., regression mean or class-probability) minimizing a population risk

R(h)=E[(h(X0),Y0)]R(h) = \mathbb{E}[\ell(h(X_0), Y_0)]

over a class of measurable functions h:XRh:\mathcal{X}\rightarrow\mathbb{R} and loss function :R×Y[0,)\ell:\mathbb{R} \times \mathcal{Y} \to [0, \infty). The excess risk is E(h)=R(h)R(h)\mathcal{E}(h) = R(h) - R(h^*) for a true minimizer hh^*.

The NPDNN is defined as the empirical risk minimizer over a structurally-constrained DNN class:

h^n,NP=argminhHσ(Ln,Nn,Bn,Fn,Sn)1ni=1n(h(Xi),Yi)\widehat h_{n,NP} = \arg\min_{h \in \mathcal H_\sigma(L_n, N_n, B_n, F_n, S_n)} \frac{1}{n} \sum_{i=1}^n \ell(h(X_i), Y_i)

where the hypothesis class is specified by:

  • Depth LnL_n
  • Maximum width NnN_n per layer
  • Uniform sup-norm bound on weights BnB_n
  • Output bound FnF_n
  • Sparsity constraint (number of nonzero weights/biases) SnS_n

No explicit penalty (e.g., 1\ell^1 or 0\ell^0) is added; control of overfitting is achieved solely by proper scaling of (Ln,Nn,Bn,Sn)(L_n, N_n, B_n, S_n) with sample size and problem complexity (Kengne et al., 29 Dec 2025).

2. Risk Bounds and Minimax-Optimal Rates

For regression, logistic regression, and general losses satisfying suitable local curvature and Lipschitz conditions, NPDNN attains (up to logarithmic factors) minimax rates of excess risk over Hölder and compositional smoothness classes (Liu et al., 2019, Kengne et al., 29 Dec 2025, Yara et al., 2024).

Pure Hölder Case

If f0Cs(X,K)f_0 \in \mathcal{C}^s(\mathcal{X},K) (Hölder smoothness ss) and the loss is quadratic or satisfies κ\kappa-local curvature, the rate is:

E[E(h^n,NP)](logn)νnκsκs+d\mathbb{E}[\mathcal{E}(\widehat h_{n,NP})] \lesssim \frac{(\log n)^\nu}{n^{\frac{\kappa s}{\kappa s + d}}}

for κ=2\kappa = 2 (e.g., squared or logistic loss) and some ν>3\nu > 3 (Kengne et al., 29 Dec 2025).

Compositional Smoothness

If f0f_0 belongs to a multi-layer compositional class of depth qq, with layerwise smoothness β\boldsymbol{\beta} and sparsity t\mathbf{t}, the critical rate is

ϕn=max0iqn2βi/(2βi+ti)\phi_n = \max_{0 \leq i \leq q} n^{-2 \beta^*_i/(2\beta^*_i + t_i)}

where βi=βij=i+1q(βj1)\beta^*_i = \beta_i \prod_{j=i+1}^q (\beta_j \wedge 1) (Kengne et al., 29 Dec 2025, Yara et al., 2024). The risk converges as

E[E(h^n,NP)](ϕnκ/2ϕn)(logn)ν.\mathbb{E}[\mathcal{E}(\widehat h_{n,NP})] \lesssim (\phi_n^{\kappa/2} \vee \phi_n) (\log n)^\nu.

Precise L2L^2 Minimax in Nonparametric Regression

With explicit B-spline construction, for f0f_0 in a Hölder (Nikolskii) ball and appropriate network architecture, the minimax-optimal rate

supf0Λβ(F)Ef^nf0L22=O(n2β/(2β+d))\sup_{f_0 \in \Lambda^\beta(F)} \mathbb{E}\|\widehat f_n - f_0\|_{L^2}^2 = O(n^{-2\beta/(2\beta + d)})

is achieved with no extraneous logarithmic term (Liu et al., 2019).

3. Key Methodologies and Neural Network Architecture

NPDNN imposes architectural constraints:

  • Depth LnL_n, width NnN_n scale logarithmically/polynomially with sample size, depending on smoothness and dimension.
  • Sparsity parameter SnS_n and sup-norm bound BnB_n prevent uncontrolled model complexity.
  • Approximation properties of DNNs (e.g., ReLU networks for Hölder or compositional classes) guarantee that, for sufficiently large (Ln,Nn,Sn)(L_n, N_n, S_n), the approximation error matches the minimax risk rate.
  • For regression, an explicit construction via embedding B-spline basis into DNNs allows the estimator

f^n(x)=jΓb^jD~j,k(x)\widehat f_n(x) = \sum_{j \in \Gamma} \widehat b_j \widetilde D_{j,k}(x)

where each D~j,k\widetilde D_{j,k} approximates the B-spline Dj,kD_{j,k} by a DNN (Liu et al., 2019).

  • In multi-class logistic regression, the empirical cross-entropy loss is minimized over sparse ReLU networks with softmax output. Theoretical covering number bounds for these networks underpin the rate analysis (Yara et al., 2024).

4. Statistical Properties: Adaptivity, Generalization, and Dependence Structure

NPDNNs require explicit selection of architecture parameters based on problem smoothness and sample size. The estimator is not adaptive: achieving minimaxity depends on tuning SnS_n, LnL_n, NnN_n at the scale implied by the target smoothness and composition structure (Kengne et al., 29 Dec 2025, Kurisu et al., 2022). In contrast, sparse-penalized DNNs (SPDNN) incorporating 0\ell^0-type penalties can adapt automatically.

The guarantees of NPDNN extend across a spectrum of dependence models, owing to the proof’s reliance on generalized Bernstein-type inequalities:

  • i.i.d. and ϕ\phi-mixing: rates as above.
  • Exponentially and subexponentially α\alpha-mixing, geometric and polynomial C\mathcal{C}-mixing: effective sample size is reduced (e.g., n/(logn)2n/(\log n)^2) and rates adjust accordingly (Kengne et al., 29 Dec 2025, Kurisu et al., 2022).

5. Extension to Sup-Norm and Adversarial Risk

Standard NPDNN with least-squares loss and standard empirical training does not guarantee minimax-optimal sup-norm error due to possible local overfitting or oscillations. However, an adversarially-trained variant with correction achieves the minimax sup-norm rate, provided label preprocessing is used:

E[f^f2]Chd{(LW)2log(LW)lognn+(LW)4β/d+hdζn2}E[\|\hat f - f^*\|_\infty^2] \leq C h^{-d} \biggl\{ \frac{(LW)^2\log(LW)\log n}{n} + (LW)^{-4\beta/d} + h^{-d}\zeta_n^2 \biggr\}

where ζn2\zeta_n^2 reflects bias from label preprocessing, hh is adversarial perturbation radius, and LWLW are network size parameters (Imaizumi, 2023).

Optimization proceeds as an alternating minimax game, with adversarial data augmentation and SGD-like updates, and selection of (L,W)(L,W) matches the optimal sup-norm rate n2β/(2β+d)n^{-2\beta/(2\beta+d)} up to log factors (Imaizumi, 2023).

6. Hypothesis Testing and Asymptotic Inference

For nonparametric L2L^2 regression, NPDNNs allow construction of minimax-optimal hypothesis tests for H0:f0=0H_0: f_0 = 0. The test statistic

Tn=1ni=1nf^net(Xi)2T_n = \frac{1}{n}\sum_{i=1}^n \widehat f_{\text{net}}(X_i)^2

standardized appropriately, converges in distribution to a standard normal under the null. The test achieves minimax separation rates at the Ingster boundary n2β/(4β+d)n^{-2\beta/(4\beta + d)} (Liu et al., 2019).

Furthermore, under mild additional regularity, the network estimator is pointwise asymptotically normal:

f^net(x)f0(x)Dk(x)T(ΦTΦ)1Dk(x)DN(0,1)\frac{\widehat f_{\rm net}(x)-f_0(x)}{\sqrt{D_k(x)^T( \Phi^T\Phi)^{-1}D_k(x)}} \stackrel{D}{\rightarrow} N(0,1)

with the approximation error negligible relative to the stochastic term (Liu et al., 2019).

7. Practical Considerations and Empirical Guidelines

  • The depth of NPDNN grows only logarithmically with sample size, while width and sparsity scale as mild polynomials in nn and dimension (Kengne et al., 29 Dec 2025).
  • Explicit sup-norm or weight-size constraints BnB_n prevent parameter blowup.
  • In the absence of structural adaptivity (as in penalized methods), cross-validation or theoretical pilot estimation of smoothness is necessary to select architecture.
  • Even for nonstationary or highly-dependent time series, as long as the mixing or dependence function is quantifiable, scaling the effective sample size in the rate formulas retains theoretical guarantees (Kurisu et al., 2022).
  • In multivariate and logistic regression, achieving minimax Hellinger risk rates requires sparse, deep ReLU architectures, with softmax output, and the approach is robust against the divergence of Kullback–Leibler excess risk in excessively flexible models (Yara et al., 2024).

References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Non-Penalized Deep Neural Network Estimator (NPDNN).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube