Non-Penalized Deep Neural Network Estimator
- NPDNN is an empirical risk minimizer over a structurally-constrained DNN class that attains minimax-optimal rates for regression and classification.
- It controls model complexity using explicit constraints on depth, width, sparsity, and weight norms without additional regularization penalties.
- Theoretical analysis demonstrates its robustness across i.i.d. and dependent data, varied loss functions, and smoothness regimes.
A non-penalized deep neural network estimator (NPDNN) is an empirical risk minimization approach over a constrained class of DNNs, with explicit constraints on network architectural parameters but without additional regularization penalties. NPDNN targets minimax-optimal estimation rates in nonparametric regression, classification, and related models, with theoretical guarantees established under general dependence structures, loss functions, and smoothness regimes. NPDNNs have been rigorously analyzed for i.i.d. and dependent data, regression and classification settings, Hellinger and losses, and are motivated by achieving sharp upper bounds for risk without redundant logarithmic factors (Liu et al., 2019, Kengne et al., 29 Dec 2025, Yara et al., 2024, Kurisu et al., 2022).
1. Statistical Framework and Formal Definition
Let be a sample, either i.i.d. or from an ergodic mixing process, with and scalar or vector-valued. The objective is to estimate a function (e.g., regression mean or class-probability) minimizing a population risk
over a class of measurable functions and loss function . The excess risk is for a true minimizer .
The NPDNN is defined as the empirical risk minimizer over a structurally-constrained DNN class:
where the hypothesis class is specified by:
- Depth
- Maximum width per layer
- Uniform sup-norm bound on weights
- Output bound
- Sparsity constraint (number of nonzero weights/biases)
No explicit penalty (e.g., or ) is added; control of overfitting is achieved solely by proper scaling of with sample size and problem complexity (Kengne et al., 29 Dec 2025).
2. Risk Bounds and Minimax-Optimal Rates
For regression, logistic regression, and general losses satisfying suitable local curvature and Lipschitz conditions, NPDNN attains (up to logarithmic factors) minimax rates of excess risk over Hölder and compositional smoothness classes (Liu et al., 2019, Kengne et al., 29 Dec 2025, Yara et al., 2024).
Pure Hölder Case
If (Hölder smoothness ) and the loss is quadratic or satisfies -local curvature, the rate is:
for (e.g., squared or logistic loss) and some (Kengne et al., 29 Dec 2025).
Compositional Smoothness
If belongs to a multi-layer compositional class of depth , with layerwise smoothness and sparsity , the critical rate is
where (Kengne et al., 29 Dec 2025, Yara et al., 2024). The risk converges as
Precise Minimax in Nonparametric Regression
With explicit B-spline construction, for in a Hölder (Nikolskii) ball and appropriate network architecture, the minimax-optimal rate
is achieved with no extraneous logarithmic term (Liu et al., 2019).
3. Key Methodologies and Neural Network Architecture
NPDNN imposes architectural constraints:
- Depth , width scale logarithmically/polynomially with sample size, depending on smoothness and dimension.
- Sparsity parameter and sup-norm bound prevent uncontrolled model complexity.
- Approximation properties of DNNs (e.g., ReLU networks for Hölder or compositional classes) guarantee that, for sufficiently large , the approximation error matches the minimax risk rate.
- For regression, an explicit construction via embedding B-spline basis into DNNs allows the estimator
where each approximates the B-spline by a DNN (Liu et al., 2019).
- In multi-class logistic regression, the empirical cross-entropy loss is minimized over sparse ReLU networks with softmax output. Theoretical covering number bounds for these networks underpin the rate analysis (Yara et al., 2024).
4. Statistical Properties: Adaptivity, Generalization, and Dependence Structure
NPDNNs require explicit selection of architecture parameters based on problem smoothness and sample size. The estimator is not adaptive: achieving minimaxity depends on tuning , , at the scale implied by the target smoothness and composition structure (Kengne et al., 29 Dec 2025, Kurisu et al., 2022). In contrast, sparse-penalized DNNs (SPDNN) incorporating -type penalties can adapt automatically.
The guarantees of NPDNN extend across a spectrum of dependence models, owing to the proof’s reliance on generalized Bernstein-type inequalities:
- i.i.d. and -mixing: rates as above.
- Exponentially and subexponentially -mixing, geometric and polynomial -mixing: effective sample size is reduced (e.g., ) and rates adjust accordingly (Kengne et al., 29 Dec 2025, Kurisu et al., 2022).
5. Extension to Sup-Norm and Adversarial Risk
Standard NPDNN with least-squares loss and standard empirical training does not guarantee minimax-optimal sup-norm error due to possible local overfitting or oscillations. However, an adversarially-trained variant with correction achieves the minimax sup-norm rate, provided label preprocessing is used:
where reflects bias from label preprocessing, is adversarial perturbation radius, and are network size parameters (Imaizumi, 2023).
Optimization proceeds as an alternating minimax game, with adversarial data augmentation and SGD-like updates, and selection of matches the optimal sup-norm rate up to log factors (Imaizumi, 2023).
6. Hypothesis Testing and Asymptotic Inference
For nonparametric regression, NPDNNs allow construction of minimax-optimal hypothesis tests for . The test statistic
standardized appropriately, converges in distribution to a standard normal under the null. The test achieves minimax separation rates at the Ingster boundary (Liu et al., 2019).
Furthermore, under mild additional regularity, the network estimator is pointwise asymptotically normal:
with the approximation error negligible relative to the stochastic term (Liu et al., 2019).
7. Practical Considerations and Empirical Guidelines
- The depth of NPDNN grows only logarithmically with sample size, while width and sparsity scale as mild polynomials in and dimension (Kengne et al., 29 Dec 2025).
- Explicit sup-norm or weight-size constraints prevent parameter blowup.
- In the absence of structural adaptivity (as in penalized methods), cross-validation or theoretical pilot estimation of smoothness is necessary to select architecture.
- Even for nonstationary or highly-dependent time series, as long as the mixing or dependence function is quantifiable, scaling the effective sample size in the rate formulas retains theoretical guarantees (Kurisu et al., 2022).
- In multivariate and logistic regression, achieving minimax Hellinger risk rates requires sparse, deep ReLU architectures, with softmax output, and the approach is robust against the divergence of Kullback–Leibler excess risk in excessively flexible models (Yara et al., 2024).
References:
- "Optimal Nonparametric Inference via Deep Neural Network" (Liu et al., 2019)
- "A general framework for deep learning" (Kengne et al., 29 Dec 2025)
- "Nonparametric logistic regression with deep learning" (Yara et al., 2024)
- "Adaptive deep learning for nonlinear time series models" (Kurisu et al., 2022)
- "Sup-Norm Convergence of Deep Neural Network Estimator for Nonparametric Regression by Adversarial Training" (Imaizumi, 2023)