Near-optimal estimates for the $\ell^p$-Lipschitz constants of deep random ReLU neural networks
(2506.19695v1)
Published 24 Jun 2025 in stat.ML, cs.LG, and math.PR
Abstract: This paper studies the $\ellp$-Lipschitz constants of ReLU neural networks $\Phi: \mathbb{R}d \to \mathbb{R}$ with random parameters for $p \in [1,\infty]$. The distribution of the weights follows a variant of the He initialization and the biases are drawn from symmetric distributions. We derive high probability upper and lower bounds for wide networks that differ at most by a factor that is logarithmic in the network's width and linear in its depth. In the special case of shallow networks, we obtain matching bounds. Remarkably, the behavior of the $\ellp$-Lipschitz constant varies significantly between the regimes $ p \in [1,2) $ and $ p \in [2,\infty] $. For $p \in [2,\infty]$, the $\ellp$-Lipschitz constant behaves similarly to $\Vert g\Vert_{p'}$, where $g \in \mathbb{R}d$ is a $d$-dimensional standard Gaussian vector and $1/p + 1/p' = 1$. In contrast, for $p \in [1,2)$, the $\ellp$-Lipschitz constant aligns more closely to $\Vert g \Vert_{2}$.
Summary
The paper provides near-matching high-probability upper and lower bounds for ℓp-Lipschitz constants in deep random ReLU networks using advanced probabilistic and geometric techniques.
It reveals a dichotomy in scaling behavior between p in [1,2) and p in [2,∞], which informs the design of adversarially robust architectures and effective initialization strategies.
The analysis employs tools such as random hyperplane tessellations, gradient concentration, and covering arguments to deliver actionable insights for network verification and safety certification.
Near-optimal Estimates for the ℓp-Lipschitz Constants of Deep Random ReLU Neural Networks
This paper provides a comprehensive probabilistic analysis of the ℓp-Lipschitz constants of deep, fully-connected ReLU neural networks with random weights and biases. The paper is motivated by the central role of Lipschitz constants in quantifying adversarial robustness and generalization properties of neural networks, as well as their relevance in theoretical frameworks such as the Neural Tangent Kernel (NTK) regime.
Problem Setting and Main Results
The authors consider deep ReLU networks Φ:Rd→R with L hidden layers of width N, initialized with a variant of He initialization: weights are i.i.d. Gaussian with variance $2/N$ (except for the final layer), and biases are drawn from symmetric or more general distributions. The focus is on the ℓp-Lipschitz constant: Lipp(Φ)=x=ysup∥x−y∥p∣Φ(x)−Φ(y)∣
for p∈[1,∞], which measures the network's sensitivity to input perturbations in the ℓp norm.
The main contributions are high-probability upper and lower bounds for Lipp(Φ), which are near-matching (up to logarithmic and depth-dependent factors) in the regime of wide networks. The results reveal a sharp dichotomy in the behavior of the Lipschitz constant between the regimes p∈[1,2) and p∈[2,∞]:
For p∈[2,∞], the Lipschitz constant scales as d1−1/p (up to logarithmic factors), mirroring the behavior of the ℓp′-norm of a standard Gaussian vector, where $1/p + 1/p' = 1$.
For p∈[1,2), the scaling is d, reflecting the ℓ2-norm of a Gaussian vector, with an additional 1/L factor in the lower bound for deep networks.
The results are summarized in the following table (zero-bias case):
p range
Lower Bound
Upper Bound
[1,2)
d/L
d⋅ln(N/d)
[2,∞]
d1−1/p
d1−1/pln(N/d)
For networks with nonzero biases, similar bounds are established, with an additional L factor in the upper bound, reflecting the increased complexity due to bias-induced inhomogeneity.
Technical Approach
The analysis leverages several advanced probabilistic and geometric tools:
Random Hyperplane Tessellations: The number of neurons at which activation patterns differ between two inputs is controlled using results on random hyperplane tessellations, allowing precise estimates of how the ReLU nonlinearity fragments the input space.
Pointwise Gradient Estimates: The authors derive sharp concentration bounds for the ℓp-norm of the gradient at a fixed input, using a randomized gradient formalism that decouples dependencies between layers.
Covering Arguments: Uniform control over the input space is achieved via ε-net arguments, with careful management of covering numbers and the associated union bounds.
Sudakov Minoration and Decoupling: For lower bounds, especially in the p∈[1,2) regime, the authors employ Sudakov's minoration and decoupling techniques to relate the supremum of the gradient norm over the input space to the geometry of Gaussian processes.
Numerical and Asymptotic Implications
Tightness: For shallow networks (L=1), the bounds are tight up to constants for all p∈[1,∞] and arbitrary bias distributions, matching previous results for p=2 and extending them to all p.
Depth and Width Dependence: The upper bounds exhibit only logarithmic dependence on width N and, in the case of nonzero biases, a L dependence on depth, which is a significant improvement over prior exponential-in-depth bounds.
Bias Robustness: The lower bounds are robust to the choice of bias distribution, while the upper bounds require only mild regularity (bounded density) and symmetry assumptions.
Practical Implications
Adversarial Robustness: The results provide explicit, high-probability estimates for the worst-case sensitivity of randomly initialized deep ReLU networks to input perturbations, directly informing the design and analysis of robust architectures.
Initialization Theory: The analysis clarifies how initialization schemes and network width/depth interact to control the Lipschitz constant, with implications for trainability and stability in deep learning.
Verification and Certification: The derived bounds can be used to certify upper bounds on the Lipschitz constant of randomly initialized networks, which is a key step in formal verification pipelines for neural network safety.
Implementation Considerations
Computational Feasibility: The bounds are non-asymptotic and can be instantiated for concrete network sizes, making them directly applicable in practice for network verification and robustness estimation.
Extension to Other Architectures: While the analysis is for fully-connected ReLU networks, the techniques (especially those based on random tessellations and covering arguments) are adaptable to other architectures and activation functions, provided similar geometric properties hold.
Scaling to Large Networks: The logarithmic dependence on width and linear (or sublinear) dependence on depth ensure that the bounds remain meaningful even for very large networks, as commonly encountered in modern deep learning.
Theoretical Implications and Future Directions
Sharpness and Gaps: The remaining logarithmic and L gaps in the upper and lower bounds (for deep, wide networks) suggest potential for further tightening, possibly via refined geometric or probabilistic arguments.
Beyond Random Initialization: Extending the analysis to trained networks, or to other initialization schemes, is a natural next step, with potential to inform both theory and practice in deep learning.
Generalization and Expressivity: The connection between Lipschitz constants, adversarial robustness, and generalization remains a fertile area for further exploration, particularly in understanding the trade-offs imposed by network architecture and initialization.
Conclusion
This work provides a rigorous, near-optimal characterization of the ℓp-Lipschitz constants of deep random ReLU networks, with strong implications for both the theory and practice of deep learning. The probabilistic and geometric techniques developed here set a new standard for the analysis of neural network sensitivity and robustness, and offer a foundation for future advances in the mathematical understanding of deep architectures.