Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 113 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Near-optimal estimates for the $\ell^p$-Lipschitz constants of deep random ReLU neural networks (2506.19695v1)

Published 24 Jun 2025 in stat.ML, cs.LG, and math.PR

Abstract: This paper studies the $\ellp$-Lipschitz constants of ReLU neural networks $\Phi: \mathbb{R}d \to \mathbb{R}$ with random parameters for $p \in [1,\infty]$. The distribution of the weights follows a variant of the He initialization and the biases are drawn from symmetric distributions. We derive high probability upper and lower bounds for wide networks that differ at most by a factor that is logarithmic in the network's width and linear in its depth. In the special case of shallow networks, we obtain matching bounds. Remarkably, the behavior of the $\ellp$-Lipschitz constant varies significantly between the regimes $ p \in [1,2) $ and $ p \in [2,\infty] $. For $p \in [2,\infty]$, the $\ellp$-Lipschitz constant behaves similarly to $\Vert g\Vert_{p'}$, where $g \in \mathbb{R}d$ is a $d$-dimensional standard Gaussian vector and $1/p + 1/p' = 1$. In contrast, for $p \in [1,2)$, the $\ellp$-Lipschitz constant aligns more closely to $\Vert g \Vert_{2}$.

Summary

  • The paper provides near-matching high-probability upper and lower bounds for ℓp-Lipschitz constants in deep random ReLU networks using advanced probabilistic and geometric techniques.
  • It reveals a dichotomy in scaling behavior between p in [1,2) and p in [2,∞], which informs the design of adversarially robust architectures and effective initialization strategies.
  • The analysis employs tools such as random hyperplane tessellations, gradient concentration, and covering arguments to deliver actionable insights for network verification and safety certification.

Near-optimal Estimates for the p\ell^p-Lipschitz Constants of Deep Random ReLU Neural Networks

This paper provides a comprehensive probabilistic analysis of the p\ell^p-Lipschitz constants of deep, fully-connected ReLU neural networks with random weights and biases. The paper is motivated by the central role of Lipschitz constants in quantifying adversarial robustness and generalization properties of neural networks, as well as their relevance in theoretical frameworks such as the Neural Tangent Kernel (NTK) regime.

Problem Setting and Main Results

The authors consider deep ReLU networks Φ:RdR\Phi: \mathbb{R}^d \to \mathbb{R} with LL hidden layers of width NN, initialized with a variant of He initialization: weights are i.i.d. Gaussian with variance $2/N$ (except for the final layer), and biases are drawn from symmetric or more general distributions. The focus is on the p\ell^p-Lipschitz constant: Lipp(Φ)=supxyΦ(x)Φ(y)xyp\mathrm{Lip}_p(\Phi) = \sup_{x \neq y} \frac{|\Phi(x) - \Phi(y)|}{\|x-y\|_p} for p[1,]p \in [1, \infty], which measures the network's sensitivity to input perturbations in the p\ell^p norm.

The main contributions are high-probability upper and lower bounds for Lipp(Φ)\mathrm{Lip}_p(\Phi), which are near-matching (up to logarithmic and depth-dependent factors) in the regime of wide networks. The results reveal a sharp dichotomy in the behavior of the Lipschitz constant between the regimes p[1,2)p \in [1,2) and p[2,]p \in [2,\infty]:

  • For p[2,]p \in [2, \infty], the Lipschitz constant scales as d11/pd^{1-1/p} (up to logarithmic factors), mirroring the behavior of the p\ell^{p'}-norm of a standard Gaussian vector, where $1/p + 1/p' = 1$.
  • For p[1,2)p \in [1,2), the scaling is d\sqrt{d}, reflecting the 2\ell^2-norm of a Gaussian vector, with an additional 1/L1/\sqrt{L} factor in the lower bound for deep networks.

The results are summarized in the following table (zero-bias case):

pp range Lower Bound Upper Bound
[1,2)[1,2) d/L\sqrt{d/L} dln(N/d)\sqrt{d \cdot \ln(N/d)}
[2,][2,\infty] d11/pd^{1-1/p} d11/pln(N/d)d^{1-1/p} \sqrt{\ln(N/d)}

For networks with nonzero biases, similar bounds are established, with an additional L\sqrt{L} factor in the upper bound, reflecting the increased complexity due to bias-induced inhomogeneity.

Technical Approach

The analysis leverages several advanced probabilistic and geometric tools:

  • Random Hyperplane Tessellations: The number of neurons at which activation patterns differ between two inputs is controlled using results on random hyperplane tessellations, allowing precise estimates of how the ReLU nonlinearity fragments the input space.
  • Pointwise Gradient Estimates: The authors derive sharp concentration bounds for the p\ell^p-norm of the gradient at a fixed input, using a randomized gradient formalism that decouples dependencies between layers.
  • Covering Arguments: Uniform control over the input space is achieved via ε\varepsilon-net arguments, with careful management of covering numbers and the associated union bounds.
  • Sudakov Minoration and Decoupling: For lower bounds, especially in the p[1,2)p \in [1,2) regime, the authors employ Sudakov's minoration and decoupling techniques to relate the supremum of the gradient norm over the input space to the geometry of Gaussian processes.

Numerical and Asymptotic Implications

  • Tightness: For shallow networks (L=1L=1), the bounds are tight up to constants for all p[1,]p \in [1,\infty] and arbitrary bias distributions, matching previous results for p=2p=2 and extending them to all pp.
  • Depth and Width Dependence: The upper bounds exhibit only logarithmic dependence on width NN and, in the case of nonzero biases, a L\sqrt{L} dependence on depth, which is a significant improvement over prior exponential-in-depth bounds.
  • Bias Robustness: The lower bounds are robust to the choice of bias distribution, while the upper bounds require only mild regularity (bounded density) and symmetry assumptions.

Practical Implications

  • Adversarial Robustness: The results provide explicit, high-probability estimates for the worst-case sensitivity of randomly initialized deep ReLU networks to input perturbations, directly informing the design and analysis of robust architectures.
  • Initialization Theory: The analysis clarifies how initialization schemes and network width/depth interact to control the Lipschitz constant, with implications for trainability and stability in deep learning.
  • Verification and Certification: The derived bounds can be used to certify upper bounds on the Lipschitz constant of randomly initialized networks, which is a key step in formal verification pipelines for neural network safety.

Implementation Considerations

  • Computational Feasibility: The bounds are non-asymptotic and can be instantiated for concrete network sizes, making them directly applicable in practice for network verification and robustness estimation.
  • Extension to Other Architectures: While the analysis is for fully-connected ReLU networks, the techniques (especially those based on random tessellations and covering arguments) are adaptable to other architectures and activation functions, provided similar geometric properties hold.
  • Scaling to Large Networks: The logarithmic dependence on width and linear (or sublinear) dependence on depth ensure that the bounds remain meaningful even for very large networks, as commonly encountered in modern deep learning.

Theoretical Implications and Future Directions

  • Sharpness and Gaps: The remaining logarithmic and L\sqrt{L} gaps in the upper and lower bounds (for deep, wide networks) suggest potential for further tightening, possibly via refined geometric or probabilistic arguments.
  • Beyond Random Initialization: Extending the analysis to trained networks, or to other initialization schemes, is a natural next step, with potential to inform both theory and practice in deep learning.
  • Generalization and Expressivity: The connection between Lipschitz constants, adversarial robustness, and generalization remains a fertile area for further exploration, particularly in understanding the trade-offs imposed by network architecture and initialization.

Conclusion

This work provides a rigorous, near-optimal characterization of the p\ell^p-Lipschitz constants of deep random ReLU networks, with strong implications for both the theory and practice of deep learning. The probabilistic and geometric techniques developed here set a new standard for the analysis of neural network sensitivity and robustness, and offer a foundation for future advances in the mathematical understanding of deep architectures.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 10 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com