Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

A Distributional View of High Dimensional Optimization (2507.16315v1)

Published 22 Jul 2025 in math.OC, math.PR, and stat.ML

Abstract: This PhD thesis presents a distributional view of optimization in place of a worst-case perspective. We motivate this view with an investigation of the failure point of classical optimization. Subsequently we consider the optimization of a randomly drawn objective function. This is the setting of Bayesian Optimization. After a review of Bayesian optimization we outline how such a distributional view may explain predictable progress of optimization in high dimension. It further turns out that this distributional view provides insights into optimal step size control of gradient descent. To enable these results, we develop mathematical tools to deal with random input to random functions and a characterization of non-stationary isotropic covariance kernels. Finally, we outline how assumptions about the data, specifically exchangability, can lead to random objective functions in machine learning and analyze their landscape.

Summary

  • The paper presents a novel distributional framework to overcome the curse of dimensionality inherent in worst-case optimization.
  • It rigorously develops measure-theoretic foundations and characterizes invariant kernels for modeling random functions in Bayesian optimization.
  • The introduced Random Function Descent (RFD) algorithm, with covariance-driven step sizes, achieves performance competitive with tuned gradient descent methods.

A Distributional View of High Dimensional Optimization

This dissertation presents a comprehensive and rigorous analysis of high-dimensional optimization from a distributional perspective, challenging the classical worst-case paradigm and providing a mathematical foundation for average-case and Bayesian approaches. The work is structured in two main parts: (1) a critical examination of worst-case black-box optimization and the curse of dimensionality, and (2) the development of a distributional (random function) framework for optimization, with applications to machine learning and neural network loss landscapes.

Worst-Case Black-Box Optimization and the Curse of Dimensionality

The initial chapters formalize the limitations of worst-case global optimization in high dimensions. Under minimal regularity assumptions (e.g., continuity or Lipschitz continuity), the number of function evaluations required to guarantee an ϵ\epsilon-optimal solution grows exponentially with the dimension dd of the domain. This is established via covering number arguments, showing that even with LL-Lipschitz or LL-smoothness assumptions, the optimal evaluation complexity scales as O(ϵd)O(\epsilon^{-d}) or O(ϵd/2)O(\epsilon^{-d/2}), respectively. The analysis is extended to general moduli of continuity, providing tight lower and upper bounds for arbitrary continuity classes.

A key insight is that higher-order oracles (e.g., access to gradients or Hessians) do not fundamentally alter the exponential scaling in the worst case, as adversarially constructed functions can withhold all useful information from the optimizer. The only way to circumvent this is to impose strong structural assumptions, such as the Polyak-Łojasiewicz (PL) inequality or strong convexity, which are not satisfied in most machine learning settings due to the prevalence of saddle points and non-convexity.

Distributional (Average-Case) Optimization: Bayesian and Random Function Approaches

Recognizing the disconnect between worst-case theory and empirical success in high-dimensional machine learning, the dissertation advocates for a distributional view of optimization. The central thesis is that real-world objective functions are not adversarially chosen; instead, they can be modeled as random functions drawn from appropriate distributions. This motivates the paper of Bayesian optimization, Kriging, and related frameworks, where the optimizer leverages probabilistic models (typically Gaussian processes) to guide the search.

Measure-Theoretic Foundations

A significant technical contribution is the rigorous measure-theoretic treatment of random function optimization. The work addresses subtle issues of measurability, conditional distributions, and the evaluation of random functions at random (possibly previsible or conditionally independent) locations. Theorems are provided to justify the common practice of treating previsible evaluation points as deterministic when computing conditional distributions, especially in the Gaussian case. This ensures that Bayesian optimization algorithms are mathematically well-founded even in infinite-dimensional settings.

Invariant Priors and Covariance Kernel Characterization

The dissertation systematically develops the theory of invariant priors over function spaces, focusing on isotropy and stationarity as natural uniformity assumptions. It provides a detailed characterization of positive definite kernels invariant under translation, rotation, and their combinations, extending classical results (Bochner, Schoenberg) to non-stationary isotropic kernels. The main result is a series representation of isotropic kernels on Rd\mathbb{R}^d (and 2\ell^2), parameterized by positive definite coefficient kernels and normalized Gegenbauer polynomials. This unifies and generalizes the kernel families used in Gaussian process modeling, including those arising from infinite-width neural networks.

Covariance of Derivatives and Strict Positive Definiteness

The work computes explicit formulas for the covariance of derivatives of isotropic random functions, which is essential for analyzing the behavior of gradient-based optimizers in the random function setting. It also establishes conditions under which the joint distribution of function values and derivatives is strictly positive definite, ensuring invertibility of covariance matrices and well-posedness of Bayesian updates.

Random Function Descent (RFD): A Scalable Bayesian Optimization Algorithm

A central practical contribution is the introduction and analysis of the Random Function Descent (RFD) algorithm. RFD is derived as the minimizer of the conditional expectation (the "stochastic Taylor approximation") of the objective, given current function and gradient information. Under isotropic Gaussian process priors, RFD is shown to coincide with gradient descent using a specific, theoretically justified step size schedule. The step size is determined by the covariance structure and adapts naturally to the local geometry of the function, providing scale invariance and equivariance properties absent in classical methods.

Explicit Step Size Schedules and Connection to Heuristics

The dissertation provides closed-form expressions for the RFD step size for common covariance models (squared exponential, Matérn, rational quadratic), and analyzes their asymptotic behavior. It is shown that RFD step sizes interpolate between "warmup" and "clipping" regimes, offering a principled explanation for empirically successful heuristics such as learning rate warmup and gradient clipping in deep learning. The asymptotic learning rate is directly linked to the covariance kernel and the difference between the current function value and the mean.

Covariance Estimation and Mini-Batch Loss

The work addresses the practical issue of estimating the covariance kernel from noisy, mini-batch-based function evaluations. It develops a nonparametric variance estimation procedure using weighted least squares regression over varying batch sizes, and proposes an entropy-maximizing batch size distribution to optimize the estimation process under computational constraints.

Empirical Validation

RFD is empirically validated on standard benchmarks (e.g., MNIST, Fashion-MNIST) using state-of-the-art convolutional networks. The results demonstrate that RFD, with covariance parameters estimated from a single epoch of data, achieves performance competitive with or superior to tuned SGD and Adam optimizers, without the need for extensive hyperparameter tuning. The step size schedule adapts automatically, and the method is robust to batch size and noise.

Theoretical and Practical Implications

The dissertation establishes that, in high dimensions, the progress of gradient-based optimizers on isotropic random functions concentrates tightly around the average-case trajectory, leading to deterministic optimization dynamics in the limit. This provides a theoretical justification for the empirical reliability of gradient descent in large-scale machine learning, despite the pessimism of worst-case analysis.

On the practical side, the RFD framework offers a scalable, theoretically grounded alternative to classical Bayesian optimization, with computational complexity matching that of gradient descent. The explicit connection between covariance structure and step size schedule enables principled design of optimizers and demystifies common heuristics.

Future Directions

The work identifies several avenues for further research:

  • Relaxing Distributional Assumptions: Extending RFD to non-Gaussian, non-isotropic, or non-stationary priors, motivated by the observation that real-world objectives (e.g., in deep learning) often violate stationarity.
  • Adaptive and Momentum Methods: Incorporating memory and adaptivity (as in Adam or momentum SGD) into the RFD framework, potentially via local or online covariance estimation.
  • Loss Landscape Analysis: Applying the distributional framework to analyze the geometry of neural network loss landscapes, including the prevalence and structure of saddle points and local minima.
  • Efficient Implementation: Optimizing the computational aspects of covariance estimation and RFD step size computation for large-scale, distributed, or online settings.

Summary Table: Key Theoretical Results

Topic Main Result/Contribution
Worst-case complexity Exponential in dimension for general continuous/Lipschitz/smooth functions
Measure-theoretic foundation Rigorous justification for previsible/conditionally independent sampling in random function optimization
Invariant kernel characterization Complete series representation for isotropic and stationary kernels on Rd\mathbb{R}^d and 2\ell^2
Covariance of derivatives Explicit formulas for gradient/Hessian covariances under isotropy
Strict positive definiteness Sufficient conditions for invertibility of joint covariance of function and derivatives
RFD algorithm Gradient descent with theoretically optimal, covariance-driven step size schedule
Covariance estimation Nonparametric, sample-efficient estimation via mini-batch variance regression
Empirical validation RFD matches or outperforms tuned SGD/Adam on standard deep learning tasks with minimal tuning

Conclusion

This dissertation rigorously demonstrates that a distributional (random function) perspective resolves the apparent paradox between the intractability of worst-case high-dimensional optimization and the empirical success of gradient-based methods in machine learning. By providing both theoretical foundations and practical algorithms, it bridges the gap between Bayesian optimization and scalable first-order methods, and offers a principled framework for future developments in high-dimensional optimization and deep learning.

Authors (1)

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets