Statistically guided deep learning (2504.08489v1)

Published 11 Apr 2025 in math.ST, cs.LG, stat.ML, and stat.TH

Abstract: We present a theoretically well-founded deep learning algorithm for nonparametric regression. It uses over-parametrized deep neural networks with logistic activation function, which are fitted to the given data via gradient descent. We propose a special topology of these networks, a special random initialization of the weights, and a data-dependent choice of the learning rate and the number of gradient descent steps. We prove a theoretical bound on the expected $L_2$ error of this estimate, and illustrate its finite sample size performance by applying it to simulated data. Our results show that a theoretical analysis of deep learning which takes into account simultaneously optimization, generalization and approximation can result in a new deep learning estimate which has an improved finite sample performance.

Summary

The paper introduces a method that derives hyperparameter choices from statistical theory to optimize nonparametric regression performance.
It leverages a novel adaptive gradient descent procedure that dynamically adjusts learning rate and steps to ensure efficient convergence.
The approach establishes unified theoretical bounds on approximation, generalization, and optimization errors, achieving near-optimal minimax rates.

This paper, "Statistically guided deep learning" (2504.08489), presents a deep learning algorithm for nonparametric regression that is grounded in statistical theory. The goal is to move beyond trial-and-error hyperparameter tuning by proposing specific choices for network topology, weight initialization, learning rate, and the number of gradient descent steps, derived from a theoretical analysis aiming for optimal convergence rates.

Core Problem Addressed:

Standard deep learning practices often lack clear theoretical guidance on selecting hyperparameters like network architecture, initialization schemes, and optimization parameters (learning rate, number of steps). This can lead to suboptimal performance or require extensive empirical tuning, as illustrated by the paper's introductory example (Figure \ref{fig_int_1}) where common initializations fail. The paper aims to provide a theoretically justified approach to constructing and training deep neural networks for regression.

Proposed Method: Statistically Guided Deep Learning Estimate

The proposed method uses over-parametrized deep neural networks and gradient descent, but with specific configurations:

Network Architecture:
- Uses the logistic squasher activation function $\sigma(x) = 1 / (1 + e^{-x})$ .
- The network computes a linear combination of $K_n$ $K_{n}$ smaller, parallel, fully connected networks. Each sub-network has $L$ $L$ hidden layers and $r$ $r$ neurons per layer (Section \ref{se2sub1}, Eqs. \ref{se2eq1}-\ref{se2eq3}).
  1 2
  
  Overall Network Output: f_w(x) = sum_{j=1}^{K_n} w_{j,1,1}^{(L)} * output_of_j-th_subnetwork(x) Each Sub-network: Standard feedforward structure with L layers, r neurons/layer.
Weight Initialization (Section \ref{se2sub2}):
- Output Weights: Initialized to zero: $w_{k,1,1}^{(L)} = 0$ . This is crucial for the theoretical analysis, starting the gradient descent from a simple function (identically zero).
- Inner Weights: Weights connecting hidden layers ( $l=1, \dots, L-1$ ) are drawn independently from a uniform distribution $U[-B_n, B_n]$ .
- Input Weights: Weights connecting the input layer to the first hidden layer ( $l=0$ ) are drawn independently from $U[-A_n, A_n]$ .
- Hyperparameters: $A_n$ and $B_n$ are key parameters. The theory suggests $A_n$ should grow polynomially with $n$ (e.g., $A_n \propto n^{1/(2p+d)} \log n$ ) and $B_n$ logarithmically ( $B_n \propto \log n$ ) for optimal rates under smoothness assumptions (Corollary \ref{co1}). The simulations explore treating $A_n, B_n$ as tunable hyperparameters.
Optimization: Standard gradient descent is used to minimize the empirical $L_2$ risk $F_n(w) = \frac{1}{n} \sum_{i=1}^n | f_w(X_i) - Y_i|^2$ (Eq. \ref{se2eq4}, \ref{se2eq5}).
Adaptive Stepsize and Number of Steps (Algorithm \ref{alg1}): This is a key practical contribution. Instead of fixing the learning rate $\lambda_n$ $λ_{n}$ and steps $t_n$ $t_{n}$ based on potentially impractical theoretical bounds, the paper proposes a data-dependent procedure:
- It searches over a sequence of candidate total steps $\hat{t}_n \in \{ 2^i \cdot t_{min} \mid i \in \mathbb{N}_0 \}$ .
- For each candidate $\hat{t}_n$ , it sets $\lambda_n = 1/\hat{t}_n$ and runs gradient descent for $t_n = \min(\hat{t}_n, t_{max,1})$ steps, where $t_{max,1}$ is a large upper bound (e.g., $\lceil (\log n)^{c_8} K_n^3 \rceil$ ).
- The algorithm stops and selects the first $\hat{t}_n$ $\hat{t}_{n}$ (giving the corresponding $t_n, \lambda_n$ $t_{n}, λ_{n}$ ) that satisfies three conditions simultaneously (with high probability):
  1. Average squared gradient norm is small: $\frac{1}{t_n} \sum_{t=0}^{t_n -1} \lambda_n \| \nabla_w F_n (w^{(t)}) \|^2 \leq \frac{c_9}{n}$ (checks if optimization is progressing).
  2. Final empirical risk is close to the average: $F_n( w^{(t_n)}) \leq \frac{1}{t_n} \sum_{t=0}^{t_n -1} F_n (w^{(t)}) + \frac{c_9}{n}$ (ensures stability).
  3. Weights stay close to initialization: $\max_{t=1, \dots, t_n} \| w^{(0)}-w^{(t)}\|^2 \leq \frac{c_9 \cdot \log n}{n}$ (key for theoretical generalization bounds).
- An inner loop within Algorithm 1 can terminate the gradient descent run for a specific $\hat{t}_n$ early if condition (1) or (3) seems likely to be violated, saving computation.
Final Estimate: The output of the network after $t_n$ steps is truncated: $m_n(x) = T_{\beta_n} (f_{w^{(t_n)}}(x))$ , where $T_{\beta_n}(z) = \max(-\beta_n, \min(z, \beta_n))$ and $\beta_n \propto \log n$ . This truncation helps control outliers and is used in the theoretical analysis.

Theoretical Contributions:

The paper provides a unified theoretical analysis considering approximation (can the network represent the true function?), generalization (does low training error imply low test error?), and optimization (does gradient descent find a good solution?).
It leverages the specific initialization and the adaptive step size rule to argue that weights $w^{(t)}$ remain in a neighborhood of the initial weights $w^{(0)}$ . This allows bounding the complexity (covering number) of the function class explored during training, leading to generalization bounds even for over-parametrized networks.
Theorem 1: Gives a general bound on the expected $L_2$ error, separating it into an approximation error term (dependent on how well the best network near $w^{(0)}$ fits the true function) and an estimation error term $\mathcal{O}(A_n^d B_n^{(L-1)d} / n^{1-\epsilon})$ .
Corollary 1: Shows that if the true regression function is $(p,C)$ -smooth and parameters $L, r, K_n, A_n, B_n$ are chosen appropriately based on $p$ and $d$ , the estimate achieves the near-optimal minimax convergence rate of $n^{-2p/(2p+d) + \epsilon}$ .

Practical Implementation Insights and Simulation Results:

Architecture: The parallel structure of $K_n$ small networks is implementable. $K_n$ becomes a key hyperparameter controlling capacity.
Initialization: The $U[-A_n, A_n]$ , $U[-B_n, B_n]$ initialization with zero output weights is simple to implement.
Adaptive Algorithm (Alg 1): The simulations show this algorithm is practically useful:
- It leads to reasonable, finite numbers of gradient descent steps, avoiding the extremely large theoretical values.
- It improves performance compared to a fixed heuristic for $t_n$ , especially for smaller network capacities ( $K_n$ ).
- It adapts naturally – for large $K_n$ , it often selects the same parameters as the fixed heuristic in the simulations.
Hyperparameter Tuning ( $A_n, B_n$ ):
- Simulations confirm $A_n$ and $B_n$ influence performance, acting as smoothing parameters (Table \ref{se5tab2}).
- Section \ref{se4bsub4} demonstrates a practical approach: use sample splitting (train/validation) to select the best $(A, B)$ pair from a predefined grid. This worked well, improving results for smaller $K_n$ and maintaining performance for larger $K_n$ .
Generalization: The method demonstrates good generalization even with significant over-parametrization (number of weights >> $n$ ), supporting the theoretical claims (Table \ref{se5tab1}, Figure \ref{fig1new}).
Performance: On the simulated 1D regression task, the proposed method (especially with adaptive $(A, B)$ selection) significantly outperformed standard deep learning approaches (different architecture, initialization, ADAM optimizer) and achieved results comparable or slightly better than a smoothing spline baseline (Tables \ref{se5tab4}, \ref{se5tab5}, Figures \ref{fig5}, \ref{fig6}).

Implementation Considerations:

Computational Cost: Training involves running gradient descent potentially multiple times within the search loop of Algorithm 1. The cost scales with $K_n$ , $L$ , $r$ , $d$ , and the number of steps $t_n$ . The parallel architecture might allow for parallel computation of sub-network gradients.
Hyperparameter Selection: While $A_n, B_n$ can be tuned via sample splitting, $K_n, L, r$ still need to be chosen. The theory gives guidance based on smoothness $p$ , but in practice, these might also be tuned using validation data. Constants in Algorithm 1 ( $c_8, c_9, t_{min}$ ) also need setting (the paper uses $t_{min}=50, c_9=10$ ).
Dimensionality: The theory covers general $d$ , but simulations are only for $d=1$ . Performance in high dimensions needs further investigation. The paper suggests this approach might be particularly beneficial there.
Optimizer: The theory and main implementation use basic gradient descent. Adapting the theory or implementation to ADAM or other optimizers would require further work.
Activation Function: The specific choice of the logistic squasher is tied to the theoretical analysis (smoothness). Using ReLU would require modifications based on related work cited (e.g., Kohler & Krzyżak (2023)).

Conclusion:

This paper presents a theoretically motivated deep learning algorithm for regression. By carefully specifying the network topology, initialization (using $A_n, B_n$ as key parameters), and proposing a novel data-dependent algorithm (Algorithm 1) for choosing the learning rate and number of gradient descent steps, the authors demonstrate improved performance and good generalization on simulated data compared to standard approaches. The key practical takeaway is the specific initialization strategy combined with Algorithm 1 for step size/number selection, and the potential use of sample splitting to tune the initialization parameters $A$ and $B$ . While demonstrated in 1D, the approach provides a promising direction for designing more principled deep learning methods.