A Distributional View of High Dimensional Optimization (2507.16315v1)
Abstract: This PhD thesis presents a distributional view of optimization in place of a worst-case perspective. We motivate this view with an investigation of the failure point of classical optimization. Subsequently we consider the optimization of a randomly drawn objective function. This is the setting of Bayesian Optimization. After a review of Bayesian optimization we outline how such a distributional view may explain predictable progress of optimization in high dimension. It further turns out that this distributional view provides insights into optimal step size control of gradient descent. To enable these results, we develop mathematical tools to deal with random input to random functions and a characterization of non-stationary isotropic covariance kernels. Finally, we outline how assumptions about the data, specifically exchangability, can lead to random objective functions in machine learning and analyze their landscape.
Summary
- The paper presents a novel distributional framework to overcome the curse of dimensionality inherent in worst-case optimization.
- It rigorously develops measure-theoretic foundations and characterizes invariant kernels for modeling random functions in Bayesian optimization.
- The introduced Random Function Descent (RFD) algorithm, with covariance-driven step sizes, achieves performance competitive with tuned gradient descent methods.
A Distributional View of High Dimensional Optimization
This dissertation presents a comprehensive and rigorous analysis of high-dimensional optimization from a distributional perspective, challenging the classical worst-case paradigm and providing a mathematical foundation for average-case and Bayesian approaches. The work is structured in two main parts: (1) a critical examination of worst-case black-box optimization and the curse of dimensionality, and (2) the development of a distributional (random function) framework for optimization, with applications to machine learning and neural network loss landscapes.
Worst-Case Black-Box Optimization and the Curse of Dimensionality
The initial chapters formalize the limitations of worst-case global optimization in high dimensions. Under minimal regularity assumptions (e.g., continuity or Lipschitz continuity), the number of function evaluations required to guarantee an ϵ-optimal solution grows exponentially with the dimension d of the domain. This is established via covering number arguments, showing that even with L-Lipschitz or L-smoothness assumptions, the optimal evaluation complexity scales as O(ϵ−d) or O(ϵ−d/2), respectively. The analysis is extended to general moduli of continuity, providing tight lower and upper bounds for arbitrary continuity classes.
A key insight is that higher-order oracles (e.g., access to gradients or Hessians) do not fundamentally alter the exponential scaling in the worst case, as adversarially constructed functions can withhold all useful information from the optimizer. The only way to circumvent this is to impose strong structural assumptions, such as the Polyak-Łojasiewicz (PL) inequality or strong convexity, which are not satisfied in most machine learning settings due to the prevalence of saddle points and non-convexity.
Distributional (Average-Case) Optimization: Bayesian and Random Function Approaches
Recognizing the disconnect between worst-case theory and empirical success in high-dimensional machine learning, the dissertation advocates for a distributional view of optimization. The central thesis is that real-world objective functions are not adversarially chosen; instead, they can be modeled as random functions drawn from appropriate distributions. This motivates the paper of Bayesian optimization, Kriging, and related frameworks, where the optimizer leverages probabilistic models (typically Gaussian processes) to guide the search.
Measure-Theoretic Foundations
A significant technical contribution is the rigorous measure-theoretic treatment of random function optimization. The work addresses subtle issues of measurability, conditional distributions, and the evaluation of random functions at random (possibly previsible or conditionally independent) locations. Theorems are provided to justify the common practice of treating previsible evaluation points as deterministic when computing conditional distributions, especially in the Gaussian case. This ensures that Bayesian optimization algorithms are mathematically well-founded even in infinite-dimensional settings.
Invariant Priors and Covariance Kernel Characterization
The dissertation systematically develops the theory of invariant priors over function spaces, focusing on isotropy and stationarity as natural uniformity assumptions. It provides a detailed characterization of positive definite kernels invariant under translation, rotation, and their combinations, extending classical results (Bochner, Schoenberg) to non-stationary isotropic kernels. The main result is a series representation of isotropic kernels on Rd (and ℓ2), parameterized by positive definite coefficient kernels and normalized Gegenbauer polynomials. This unifies and generalizes the kernel families used in Gaussian process modeling, including those arising from infinite-width neural networks.
Covariance of Derivatives and Strict Positive Definiteness
The work computes explicit formulas for the covariance of derivatives of isotropic random functions, which is essential for analyzing the behavior of gradient-based optimizers in the random function setting. It also establishes conditions under which the joint distribution of function values and derivatives is strictly positive definite, ensuring invertibility of covariance matrices and well-posedness of Bayesian updates.
Random Function Descent (RFD): A Scalable Bayesian Optimization Algorithm
A central practical contribution is the introduction and analysis of the Random Function Descent (RFD) algorithm. RFD is derived as the minimizer of the conditional expectation (the "stochastic Taylor approximation") of the objective, given current function and gradient information. Under isotropic Gaussian process priors, RFD is shown to coincide with gradient descent using a specific, theoretically justified step size schedule. The step size is determined by the covariance structure and adapts naturally to the local geometry of the function, providing scale invariance and equivariance properties absent in classical methods.
Explicit Step Size Schedules and Connection to Heuristics
The dissertation provides closed-form expressions for the RFD step size for common covariance models (squared exponential, Matérn, rational quadratic), and analyzes their asymptotic behavior. It is shown that RFD step sizes interpolate between "warmup" and "clipping" regimes, offering a principled explanation for empirically successful heuristics such as learning rate warmup and gradient clipping in deep learning. The asymptotic learning rate is directly linked to the covariance kernel and the difference between the current function value and the mean.
Covariance Estimation and Mini-Batch Loss
The work addresses the practical issue of estimating the covariance kernel from noisy, mini-batch-based function evaluations. It develops a nonparametric variance estimation procedure using weighted least squares regression over varying batch sizes, and proposes an entropy-maximizing batch size distribution to optimize the estimation process under computational constraints.
Empirical Validation
RFD is empirically validated on standard benchmarks (e.g., MNIST, Fashion-MNIST) using state-of-the-art convolutional networks. The results demonstrate that RFD, with covariance parameters estimated from a single epoch of data, achieves performance competitive with or superior to tuned SGD and Adam optimizers, without the need for extensive hyperparameter tuning. The step size schedule adapts automatically, and the method is robust to batch size and noise.
Theoretical and Practical Implications
The dissertation establishes that, in high dimensions, the progress of gradient-based optimizers on isotropic random functions concentrates tightly around the average-case trajectory, leading to deterministic optimization dynamics in the limit. This provides a theoretical justification for the empirical reliability of gradient descent in large-scale machine learning, despite the pessimism of worst-case analysis.
On the practical side, the RFD framework offers a scalable, theoretically grounded alternative to classical Bayesian optimization, with computational complexity matching that of gradient descent. The explicit connection between covariance structure and step size schedule enables principled design of optimizers and demystifies common heuristics.
Future Directions
The work identifies several avenues for further research:
- Relaxing Distributional Assumptions: Extending RFD to non-Gaussian, non-isotropic, or non-stationary priors, motivated by the observation that real-world objectives (e.g., in deep learning) often violate stationarity.
- Adaptive and Momentum Methods: Incorporating memory and adaptivity (as in Adam or momentum SGD) into the RFD framework, potentially via local or online covariance estimation.
- Loss Landscape Analysis: Applying the distributional framework to analyze the geometry of neural network loss landscapes, including the prevalence and structure of saddle points and local minima.
- Efficient Implementation: Optimizing the computational aspects of covariance estimation and RFD step size computation for large-scale, distributed, or online settings.
Summary Table: Key Theoretical Results
Topic | Main Result/Contribution |
---|---|
Worst-case complexity | Exponential in dimension for general continuous/Lipschitz/smooth functions |
Measure-theoretic foundation | Rigorous justification for previsible/conditionally independent sampling in random function optimization |
Invariant kernel characterization | Complete series representation for isotropic and stationary kernels on Rd and ℓ2 |
Covariance of derivatives | Explicit formulas for gradient/Hessian covariances under isotropy |
Strict positive definiteness | Sufficient conditions for invertibility of joint covariance of function and derivatives |
RFD algorithm | Gradient descent with theoretically optimal, covariance-driven step size schedule |
Covariance estimation | Nonparametric, sample-efficient estimation via mini-batch variance regression |
Empirical validation | RFD matches or outperforms tuned SGD/Adam on standard deep learning tasks with minimal tuning |
Conclusion
This dissertation rigorously demonstrates that a distributional (random function) perspective resolves the apparent paradox between the intractability of worst-case high-dimensional optimization and the empirical success of gradient-based methods in machine learning. By providing both theoretical foundations and practical algorithms, it bridges the gap between Bayesian optimization and scalable first-order methods, and offers a principled framework for future developments in high-dimensional optimization and deep learning.
Follow-up Questions
- How does the distributional framework improve practical optimization compared to worst-case approaches?
- What measure-theoretic challenges are addressed in establishing the framework for random function evaluation?
- How do invariant priors and isotropic kernels influence the behavior of Bayesian optimization algorithms?
- In what ways does the RFD algorithm justify and explain popular heuristics like learning rate warmup and gradient clipping?
- Find recent papers about Bayesian optimization in high dimensional spaces.
Related Papers
- A Tutorial on Bayesian Optimization (2018)
- Are Random Decompositions all we need in High Dimensional Bayesian Optimisation? (2023)
- Scalable Bayesian Optimization Using Deep Neural Networks (2015)
- Scalable Bayesian Inference in the Era of Deep Learning: From Gaussian Processes to Deep Neural Networks (2024)
- Random Function Descent (2023)
- Large-Scale Methods for Distributionally Robust Optimization (2020)
- Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning (2019)
- A Mean Field View of the Landscape of Two-Layers Neural Networks (2018)
- What is the long-run distribution of stochastic gradient descent? A large deviations analysis (2024)
- Gradient-free optimization via integration (2024)
Authors (1)
Tweets
alphaXiv
- A Distributional View of High Dimensional Optimization (6 likes, 0 questions)