Dice Question Streamline Icon: https://streamlinehq.com

Remove ε in the optimal rate of convergence for the expected L2 error

Establish that the truncated deep neural network regression estimator defined in Section 2—constructed as a linear combination of K_n fully connected logistic-squasher networks of depth L and width r with uniform-bounded input and inner weights, zero-initialized output weights, and trained via gradient descent with stepsize and iteration count chosen to satisfy conditions (6)–(8)—achieves the optimal expected L2 error rate n^{-2p/(2p+d)} for (p,C)-smooth regression functions without the ε>0 slack in the exponent; specifically, prove that ∫|m_n(x)−m(x)|^2 dP_X ≤ C·n^{-2p/(2p+d)} under the same assumptions (bounded support of X, finite exponential moment E[exp(c_7 Y^2)]<∞, and the specified parameter choices for K_n, L, r, A_n, B_n, λ_n, t_n, and truncation β_n).

Information Square Streamline Icon: https://streamlinehq.com

Background

Corollary 1 proves that, for (p,C)-smooth regression functions, the estimator attains the rate ∫|m_n−m|2 dP_X ≤ c·n{-2p/(2p+d)+ε} for any ε>0 with appropriately chosen architecture, initialization bounds, and gradient-descent parameters. This matches Stone’s minimax rate up to the ε slack.

The authors note that the ε term emerges from metric entropy bounds used to control the complexity of the over-parameterized network class. Eliminating this slack would yield a fully optimal rate without logarithmic or ε overheads, aligning the result exactly with the classical minimax benchmark.

References

According to Stone (1982) the optimal minimax rate of convergence of the expected $L_2$ error in case of a $(p,C)$--smooth regression function is (cf., e.g., Chapter 3 in Gy"orfi et al. (2002)) so the rate of convergence above is optimal up to the arbitrarily small $\epsilon>0$ in the exponent. It is an open problem whether a corresponding result can also be shown with $\epsilon=0$. In our proof this $\epsilon$ appears due to our use of the metric entropy bounds for bounding the complexity of our over-parametrized space of deep neural networks.

Statistically guided deep learning (2504.08489 - Kohler et al., 11 Apr 2025) in Remark 2, Section 3.2