Riemannian Bilevel Optimization (RBLO)

Updated 20 October 2025

RBLO is a framework that extends bilevel optimization to Riemannian manifolds, addressing hierarchical problems with intrinsic geometric constraints such as orthogonality and positive definiteness.
Key methodologies include hypergradient computation through Hessian inversion, tangent-space conjugate gradient, and truncated Neumann series, which effectively manage manifold curvature and complexity.
RBLO algorithms exhibit robust convergence guarantees and have been successfully applied in areas like robust covariance estimation, meta-learning on the Stiefel manifold, and hyper-representation learning.

Riemannian bilevel optimization (RBLO) generalizes classical bilevel optimization by incorporating variables constrained to Riemannian manifolds. This extension is motivated by modern machine learning and signal processing tasks, where constraints such as orthogonality, positive definiteness, or low-rankness naturally equip variables with manifold structures. Formally, RBLO addresses hierarchical problems of the form: $\min_{x \in \mathcal{M}_x} F(x) = f(x, y^*(x)) \quad \text{where} \quad y^*(x) = \arg\min_{y \in \mathcal{M}_y} g(x, y)$ where $\mathcal{M}_x$ and $\mathcal{M}_y$ are differentiable manifolds equipped with Riemannian metrics, and $f,g$ are smooth functions. This framework requires novel algorithmic tools for hypergradient computation, manifold-adapted optimization, and convergence analysis to manage the geometric and bilevel structure simultaneously.

1. Mathematical Formulation and Problem Structure

RBLO formalizes a two-level optimization problem where both the upper-level ( $x$ ) and lower-level ( $y$ ) variables are elements of possibly distinct Riemannian manifolds. The lower-level objective $g(x, y)$ is typically assumed to be geodesically strongly convex in $y$ , ensuring uniqueness and continuous dependence of $y^*(x)$ on $x$ . The upper-level problem then becomes minimizing the composite function $F(x) = f(x, y^*(x))$ . The implicit dependence of $F$ on $x$ through $y^*(x)$ , together with the non-Euclidean geometry, necessitates careful analysis of optimality conditions and differentiability.

The key quantity for first-order RBLO algorithms is the Riemannian hypergradient of $F(x)$ . By applying the Riemannian implicit function theorem and chain rule, it is established (see, e.g., (Li et al., 3 Feb 2024, Han et al., 6 Feb 2024)) that: $\operatorname{grad}_x F(x) = \operatorname{grad}_x f(x, y^*(x)) - \nabla^2_{y,x} g(x, y^*(x)) \Big[H_y(g(x, y^*(x)))^{-1} \operatorname{grad}_y f(x, y^*(x))\Big]$ where $H_y(g)$ is the Riemannian Hessian of $g$ in $y$ , and $\nabla^2_{y,x} g$ denotes the Riemannian cross-derivative operator. These differential geometric objects capture parallel transport, projections to tangent spaces, and the local curvature of the manifold.

2. Hypergradient Estimation and Algorithmic Strategies

Efficient computation of the hypergradient is central to RBLO. Due to the lack of closed-form solutions for $y^*(x)$ and the computational cost of Hessian inverses, several practical strategies have been proposed (Li et al., 3 Feb 2024, Han et al., 6 Feb 2024, Shi et al., 8 Apr 2025). These include:

The Hessian Inverse (HINV) approach computes the hypergradient using an explicit or iteratively approximated inversion of the lower-level Hessian.
Tangent-space Conjugate Gradient (CG) solves the required linear systems implicitly, relying on Hessian–vector products without forming the full Hessian.
Truncated Neumann Series (NS) approximates the inverse Hessian via a power series summing terms of the form $(I - \gamma H)^k$ , controlled by the Lipschitz constant and step size.
Automatic Differentiation (AD), using unrolled lower-level iterations, enables gradient computation by differentiating through the sequence of manifold operations such as retractions or exponential maps.

These methods trade off between computational efficiency and hypergradient estimation fidelity. For instance, CG-based estimators are scalable and accurate when equipped with proper stopping criteria on inner loops, while AD is generally simpler to implement but potentially suffers from curvature-induced errors.

A spectrum of RBLO algorithms leverages these estimators:

RieBO (deterministic gradient-based; (Li et al., 3 Feb 2024))
RieSBO (stochastic setting, uses Neumann series; (Li et al., 3 Feb 2024))
RHGD (Riemannian hypergradient descent with options for HINV/CG/NS/AD; (Han et al., 6 Feb 2024))
AdaRHD (Adaptive Riemannian Hypergradient Descent; parameter-free step size adaptation; (Shi et al., 8 Apr 2025))
FBDA (Full Bilevel Descent Aggregation; aggregates upper- and lower-level gradients; (Chen et al., 17 Oct 2025))
RF²SA (first-order approximation using a Lagrangian multiplier; avoids second-order information; (Dutta et al., 22 May 2024))

These algorithms utilize retractions (as computationally efficient surrogates for exponential maps), parallel transport for tangent space alignment, and hybrid step size strategies (e.g., Barzilai–Borwein followed by diminishing steps) to respect manifold geometry.

3. Convergence Analysis and Oracle Complexity

Convergence guarantees for RBLO algorithms require blending Riemannian optimization theory with bilevel-specific complexity analysis. Under standard assumptions—geodesic strong convexity and Lipschitz smoothness, bounded sectional curvature, and the global injectivity radius—it is established that:

For deterministic schemes (e.g., RHGD, AdaRHD), an $\epsilon$ -stationary point (defined by $\|\operatorname{grad} F(x)\| \leq \epsilon$ ) is attainable in $\mathcal{O}(1/\epsilon)$ outer iterations, with total gradient complexity for the upper-level function scaling as $\mathcal{O}(1/\epsilon)$ and that for the lower-level function as $\mathcal{O}(1/\epsilon^2)$ (Shi et al., 8 Apr 2025).
CG-based hypergradient computation further refines the second-order complexity to nearly logarithmic in $1/\epsilon$ by exploiting the quadratic convergence of conjugate gradient methods on manifolds (Shi et al., 8 Apr 2025).
Stochastic variants (e.g., RieSBO, stochastic RHGD) admit complexity bounds with variance-dependent terms; e.g., $\mathcal{O}(\kappa^5/\epsilon^2)$ for upper-level gradient evaluations, where $\kappa$ denotes the condition number of the lower-level problem (Han et al., 6 Feb 2024).
The use of general retraction mappings, as opposed to exponential maps, does not affect the overall convergence rates but influences constants in the Lyapunov/complexity bounds, provided the approximation error is suitably controlled (Han et al., 6 Feb 2024, Shi et al., 8 Apr 2025).

Convergence is further supported by Lyapunov-based arguments and stationarity/optimality gap analyses. These results are robust to manifold curvature up to moderate levels, and additional terms in complexity (e.g., factors involving curvature constants $\zeta$ ) are often explicit in the derived bounds.

4. Adaptive and First-Order Approaches

The development of parameter-free and fully first-order methods has removed restrictive requirements on prior knowledge of problem constants or explicit calculation of second-order derivatives. AdaRHD (Shi et al., 8 Apr 2025) is notable in this context—each loop adapts its step size using an "inverse cumulative gradient norm" rule, yielding sufficient descent without explicit curvature or Lipschitz constants. This adaptive methodology retains $\mathcal{O}(1/\epsilon)$ complexity and supports retraction-based manifold updates for scalability.

Additionally, fully first-order schemes such as RF²SA (Dutta et al., 22 May 2024) leverage Lagrangian reformulations with a ramping multiplier. The main insight is that, for sufficiently large multipliers, the solution to the penalized Lagrangian approaches that of the original bilevel problem, and the gradient discrepancy shrinks at the rate $\mathcal{O}(1/\lambda)$ . The entire procedure circumvents Hessian computation, and convergence rates range from $\tilde{\mathcal{O}}(K^{-2/7})$ to $\tilde{\mathcal{O}}(K^{-2/3})$ depending on gradient noise levels.

Bilevel descent aggregation (BDA; (Chen et al., 17 Oct 2025)) co-optimizes upper- and lower-level variables in tandem, using a convex combination of their Riemannian gradients. This strategy accelerates objective descent, especially in practical problems with complex manifold geometry.

5. Applications and Empirical Results

RBLO has seen successful application in varied domains, leveraging the ability to model hierarchical problems with intrinsic geometric constraints. Key applications from recent literature include:

Robust Karcher mean estimation on $\mathrm{SPD}(n)$ : lower level solves for a robust mean of symmetric positive definite matrices; upper level adapts sample weights. Both RieBO and AdaRHD demonstrate decreasing objective value and stationarity metrics (Li et al., 3 Feb 2024, Shi et al., 8 Apr 2025).
Hyper-representation learning: bilevel structure for learning discriminative embeddings of SPD matrices, crucial for downstream regression/classification; empirical results corroborate improved generalization from manifold-aware bilevel optimization (Han et al., 6 Feb 2024, Shi et al., 8 Apr 2025).
Meta-learning on the Stiefel manifold: common base parameters are optimized on orthogonality-constrained spaces; RBLO methods surpass extrinsic (project-and-correct) baselines (Han et al., 6 Feb 2024).
Multi-view hypergraph spectral clustering: using the Grassmannian to encode subspace constraints, the FBDA algorithm achieves improved clustering accuracy, NMI, ARI, and F1 scores compared to Euclidean and naive RBLO approaches on the 3sources dataset (Chen et al., 17 Oct 2025).
Unsupervised domain adaptation via optimal transport: a Riemannian bilevel formulation over doubly stochastic and SPD manifolds enhances domain transfer performance (Han et al., 6 Feb 2024).
Robust MLE of covariance matrices and data hypercleaning tasks: several works report superior convergence speed, robustness, and final accuracy using the proposed Riemannian techniques, both in synthetic and real data regimes.

6. Theoretical and Practical Implications

The extension of bilevel optimization to Riemannian manifolds significantly broadens the scope of hierarchical modeling. Several implications are evident:

Transferability of Euclidean results: Many key complexity bounds and convergence rates port from flat to curved spaces, often "almost without loss," provided appropriate manifold conditions hold (Li et al., 3 Feb 2024, Han et al., 6 Feb 2024).
Curvature adaptivity: Modern algorithms either explicitly adapt to or are robust against manifold curvature, and the use of retractions makes the approach practical even for large-scale, high-dimensional problems (Shi et al., 8 Apr 2025, Han et al., 6 Feb 2024).
Coordination of upper/lower levels: Gradient aggregation mechanisms and hybrid step sizes provide effective means to leverage structural information from both problem levels, improving both convergence and empirical accuracy (Chen et al., 17 Oct 2025).
Scalability: Single-loop, first-order, and inversion-free variants make RBLO feasible for real-world data and computational constraints (Dutta et al., 22 May 2024).
Robustness to parameter mis-specification: Adaptive step size frameworks obviate the need for problem-specific tuning, enabling black-box deployment of RBLO solvers in diverse contexts.

On a broader scientific level, RBLO provides a natural language for meta-learning with geometric priors, hierarchical statistical inference, and physics-based modeling where coordinate invariance is requisite.

7. Future Directions

Open challenges and prospective extensions in RBLO include:

Curvature-independent convergence: Efforts to further attenuate or eliminate curvature-dependent terms in complexity bounds would enhance generality and performance on highly curved manifolds.
Non-smooth and non-strongly convex lower-level problems: Extensions to settings where geodesic convexity fails would broaden the utility of RBLO, though foundational results highlight increased difficulty in these regimes.
Stochastic and distributed RBLO: Large-scale learning demands algorithms robust to mini-batch noise and suitable for parallel architectures, motivating further algorithmic and theoretical work (Han et al., 6 Feb 2024, Dutta et al., 22 May 2024).
Unifying hypergradient computation: Developing more unified automatic differentiation frameworks for Riemannian settings remains an active area, aiming at seamless integration with deep learning toolchains.
Application expansion: The RBLO paradigm is poised for impact in graph neural networks with geometric constraints, decentralized multi-agent systems, and scientific computing where manifold-valued data abound.

In summary, recent advances in RBLO have produced a mature suite of algorithms, supported by rigorous convergence analysis and robust empirical validations. By faithfully accounting for geometric structure at both problem levels, RBLO serves as a foundation for hierarchical decision-making in modern geometric machine learning and allied disciplines.