Papers
Topics
Authors
Recent
Search
2000 character limit reached

Budget Constraints as Riemannian Manifolds

Published 1 May 2026 in cs.LG | (2605.00649v1)

Abstract: Assigning one of K options to each of N groups under a total cost budget is a recurring problem in machine learning, appearing in mixed-precision quantization, non-uniform pruning, and expert selection. The objective (model loss) depends jointly on all assignments and does not decompose across groups, which prevents combinatorial solvers from optimizing the true objective directly and limits them to proxy objectives. Evolutionary search evaluates the actual loss but lacks gradient information, while penalty-based methods provide gradients but enforce the budget only approximately and require sensitive hyperparameter tuning. We observe that under softmax relaxation, the budget constraint defines a smooth Riemannian manifold in logit space with particularly simple geometry: the normal vector is available in closed form, shifting logits along the cost vector changes expected cost monotonically, allowing binary-search retraction, and vector transport reduces to a single inner product. Building on this structure, we propose Riemannian Constrained Optimization (RCO), which augments a standard Adam update with tangent projection, binary-search retraction, and momentum transport. Combined with Gumbel straight-through estimation and budget-constrained dynamic programming for discrete feasibility, RCO enables first-order optimization of the true objective under exact budget enforcement, without introducing constraint hyperparameters. On synthetic knapsack problems with known optima, the manifold-based constraint handling recovers optimal solutions, whereas penalty methods plateau at 83% of optimal. On LLM compression tasks, including mixed-precision quantization and MoE expert pruning, RCO matches or exceeds evolutionary search methods while requiring 3x to 16x lower wall-clock cost on the evaluated configurations.

Authors (2)

Summary

  • The paper introduces a novel geometric formulation by modeling budget constraints as a Riemannian manifold in logit space to ensure exact constraint enforcement.
  • It employs tangent projection and binary search retraction, eliminating hyperparameter tuning while maintaining feasibility at every optimization step.
  • Empirical results on knapsack problems and LLM compression demonstrate superior precision and efficiency over traditional penalty and Lagrangian methods.

Budget Constraints as Riemannian Manifolds: A Formal and Technical Overview

Introduction and Motivation

The optimization of discrete assignments subject to a total cost constraint is prevalent in machine learning—especially in contexts such as mixed-precision quantization, non-uniform pruning, and expert selection within large models. The non-decomposability of the objective across assignment groups introduces complexity: combinatorial methods enforce budgets exactly but only optimize proxy objectives, while penalty-based gradient methods offer efficiency but violate the constraint and require hyperparameter tuning. This paper introduces a geometric framework, parameterizing softmax-relaxed assignments and treating the budget constraint as a smooth Riemannian manifold in the logit space. This formulation enables exact budget enforcement with first-order optimization via Riemannian Constrained Optimization (RCO), obviating hyperparameter tuning and providing efficient, scalable algorithms for model compression tasks (2605.00649).

Geometry of the Budget Manifold

The core theoretical contribution establishes that the level set M={α:C(α)=B}M = \{\boldsymbol{\alpha} : C(\boldsymbol{\alpha}) = B\}, where C(α)C(\boldsymbol{\alpha}) is the softmax-parameterized expected cost, forms a well-behaved Riemannian submanifold in logit space. The normal vector at each point is closed-form: (C)ik=wipik(ckEpi[c])(\nabla C)_{ik} = w_i\, p_{ik}(c_k - E_{p_i}[c]). This geometry ensures that tangent projection, monotonic retraction (via binary search), and vector transport reduce to computationally trivial steps—each requiring only inner products or scalar root-finding. These operations enable the optimizer to maintain feasibility after every iteration, in sharp contrast to penalty/Lagrangian schemes. Figure 1

Figure 1: MM is preserved at each step by tangent projection, Adam update, and retraction via binary search along cost vector directions.

The monotonicity of the expected cost when shifting along the cost vector simplifies retraction to a scalar monotone root-finding problem, permitting highly efficient budget corrections. Extensions are provided for inequality constraints (via slack variables) and multiple simultaneous constraints (projecting out multiple normals), with the same geometric rationale and computational simplicity.

Algorithmic Framework: Riemannian Constrained Optimization (RCO)

RCO wraps tangent projection, monotonic retraction, and momentum transport around standard Adam steps. Discrete assignments are produced via Gumbel-STE sampling and budget-constrained dynamic programming (DP), ensuring discrete feasibility in the forward pass. The backward pass projects gradients onto the budget manifold, ensuring continuous feasibility and eliminating constraint-gradient bias before it can accumulate in optimizer momentum.

Specifically, each step:

  • Projects the loss gradient onto the tangent space of the manifold.
  • Applies an Adam update, which may leave the iterate off-manifold due to curvature or non-uniform scaling.
  • Retracts back to the manifold via binary search along the cost vector.
  • Transports optimizer momentum to the new tangent space.
  • Handles discrete assignment feasibility using DP and differentiable STE surrogates.

This framework exactly enforces the budget at every step, with no hyperparameters controlling constraint strength or forgiveness, and negligible computational overhead.

Empirical Validation

The paper benchmarks manifold constraint handling against Lagrangian and penalty methods on synthetic multiple-choice knapsack problems (MCKP) with known optima. Manifold projection achieves exact constraint satisfaction (C(α)B<108|C(\boldsymbol{\alpha}) - B| < 10^{-8}) throughout optimization, while Lagrangian methods exhibit persistent violations (up to 10110^{-1}). On hard instances, manifold optimization converges to the DP optimum up to 8×8\times closer than augmented Lagrangian approaches. Figure 2

Figure 2: Manifold optimization outperforms penalty approaches in gap to DP optimum and exact constraint satisfaction on huge knapsack scenarios (N=1000,K=32N=1000, K=32).

Figure 3

Figure 3: Per-constraint violation traces for m=16m=16 simultaneous constraints highlight orders-of-magnitude superiority in feasibility for RCO versus Lagrangian baselines.

Figure 4

Figure 4

Figure 4: Adversarial MCKP scenario demonstrates robustness of manifold projection under challenging cost structures.

Figure 5

Figure 5

Figure 5: Boundary-clustered scenario shows RCO attaining low-budget solutions where Lagrangian methods plateau.

In LLM compression (mixed-precision quantization, MoE expert pruning), RCO matches or exceeds evolutionary search baselines (e.g., EvoESAP, EvoPress) at $3$–C(α)C(\boldsymbol{\alpha})0 lower wall-clock cost across diverse configurations. For example, at high compression on Qwen3-8B, RCO improves FineWeb perplexity by 36% over HIGGS and 4% over EvoPress, with marked efficiency gains. In MoE expert pruning, nonuniform RCO allocation recovers up to 97% of HumanEval as compared to only 55% for uniform allocation, with sharp trade-offs observed depending on calibration domain.

Technical Implications and Claims

Key claims and results substantiated in the paper include:

  • Exact budget enforcement: Unlike Lagrangian methods, the RCO maintains constraint satisfaction to within floating-point precision throughout optimization, independent of gradient estimator bias or optimizer state.
  • Superior performance on non-decomposable objectives: On MCKP and LLM compression problems, RCO achieves closer convergence to combinatorial optima and outperforms surrogate methods, particularly at extreme compression where layer interactions are nontrivial.
  • Hyperparameter-free constraint handling: The manifold projection does not require C(α)C(\boldsymbol{\alpha})1 or schedule tuning, which is an Achilles heel of penalty approaches.
  • Efficient scaling to large parameter spaces: The modularity and trivial computational cost of the geometric operations enable scaling to hundreds or thousands of groups or constraint dimensions.
  • Robustness to discrete and combinatorial assignment variance: Averaging Gumbel samples per step enhances optimization stability and quality, a unique facilitation enabled by the manifold structure.

Theoretical and Practical Implications

Practically, the approach enables efficient, scalable, and exact optimization in large-scale model compression and resource allocation tasks. Theoretically, it demonstrates the power of exploiting manifold geometry for constraint satisfaction, providing a blueprint for extending constrained optimization to other domains by identifying or constructing well-behaved constraint manifolds. In settings with multiple resource constraints or nonconvex discrete assignment landscapes, the geometric approach offers substantial improvement in feasibility, robustness, and solution quality.

The separation of discrete feasibility (via DP with Gumbel sampling) from continuous feasibility (via manifold projection) yields optimization architectures that are both modular and analytically tractable. The method's robustness to gradient bias and potential for modular composability with modern optimizers provides avenues for extension to constrained fine-tuning, reinforcement learning with hard constraints, and beyond.

Future work may include manifold-based enforcement of nonlinear constraints (e.g., inference latency), combinatorial optimization in stochastic settings, and rigorous convergence proofs in presence of biased gradient estimators such as STE.

Conclusion

The analysis and empirical results in "Budget Constraints as Riemannian Manifolds" (2605.00649) rigorously establish that the expected-cost level set under softmax is a structurally clean Riemannian manifold enabling exact and efficient budget enforcement. The RCO algorithm leverages this geometry, delivering strong empirical performance and scalability without the limitations of proxy objectives or hyperparameter dependency endemic to penalty methods. It sets a technical foundation for manifold-based constraint handling in large-scale optimization problems in machine learning and opens further avenues into combinatorial and geometric optimization for practical AI deployments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (big picture)

Imagine you’re packing a backpack with a strict weight limit. You have lots of items (like snacks, clothes, tools), and for each item you must choose one version (light, medium, or heavy). You want the best overall trip experience, but you cannot go over the weight limit.

This paper tackles the same kind of problem in machine learning: choosing, for each part of a big model, one option (like how many bits to use or whether to keep/prune a component) so that:

  • the model stays within a total “budget” (like size, speed, or memory), and
  • the model’s performance stays as good as possible.

Their core idea is to treat the exact budget limit as a smooth surface you can “walk on” while optimizing, so the budget is always exactly respected—no guessing or penalty tuning needed.


What questions the authors ask

  • Can we pick options for many parts of a model so we hit an exact total budget (not just approximately) while still using fast, gradient-based training?
  • Can we avoid slow, guess-and-check methods (like evolutionary search) and also avoid penalty tricks that often overshoot or undershoot the budget?
  • Can we make this work on real tasks like compressing LLMs without losing too much quality?

How their method works (in everyday terms)

Think of all possible option choices as a big control panel with many sliders. Each slider controls how likely you are to pick a certain option for a given part (e.g., a layer’s bitwidth). These sliders produce:

  • probabilities for each option (using a standard tool called “softmax”), and
  • an expected total “cost” (like total model size).

Now, picture all slider settings that exactly meet the budget as forming a smooth “surface” in space. The authors:

  1. Move along this surface in directions that improve the model without changing the total cost (this is like sliding sideways along a hill at the same height—staying on the same “budget level”).
  2. If a move drifts slightly off the surface (because optimizers like Adam rescale steps), they nudge back to the surface by turning one special knob that raises or lowers the overall cost in a predictable way. They use binary search—like moving left or right until the cost matches the budget exactly.
  3. They also adjust the optimizer’s “momentum” so it keeps pointing along the surface after each nudge.

To actually test the model with real, discrete choices (not just probabilities), they do two complementary things:

  • Forward pass (testing): Use a clever randomizer (Gumbel) and a fast “smart packing” algorithm (dynamic programming for knapsack) to pick a concrete set of options that fits the budget exactly.
  • Backward pass (learning): Use a “straight-through estimator” to send gradient information back through those discrete choices, then project that gradient so it never pushes away from the budget.

Why this is neat:

  • The “budget surface” has a very clean geometry. The direction that changes total cost (the “normal”) is simple to compute, and the “retraction” (nudging back onto the surface) is just a quick binary search along a single direction.
  • No extra penalty knobs to tune. The budget is satisfied exactly at every step.

What they found and why it matters

On toy problems where the best answer is known (multiple-choice knapsack):

  • Their method hits the exact budget (down to tiny numerical error) and finds optimal or near-optimal solutions.
  • Standard penalty or Lagrangian methods often keep violating the budget and get stuck at only about 83% of the best possible score on tough cases.

On real model compression tasks:

  • Mixed-precision quantization (choosing bits per layer) and MoE expert pruning (choosing which experts to keep/drop):
    • Their method matches or beats evolutionary search methods while being 3–16× faster in the tests they ran.
    • At very high compression (very tight budgets), it outperforms methods that rely on simple per-layer scores or surrogates, which can break down.

Why it’s important:

  • Exact budget control is valuable for deploying AI on devices with strict limits (phones, edge devices, specific servers).
  • Saving time (fewer model evaluations) means cheaper, faster iteration and tuning.
  • Avoiding fragile penalty settings makes the process more reliable.

What this could change (impact and limitations)

  • Impact:
    • Provides a practical way to make large models smaller or faster without guessing penalty strengths or running slow search for hours.
    • Offers precise control: you decide the budget, and the optimizer respects it at every step.
    • Can be combined with different optimizers and tasks since it’s a “wrapper” around standard gradient methods.
  • Limitations to keep in mind:
    • The forward “smart packing” step (dynamic programming) depends on budget discretization and the number of options; with huge option sets, it could become a bottleneck.
    • The technique relies on costs being a simple linear combination of choices. If the true cost is nonlinear (like complicated latency), the “simple nudge back” trick may need adjustments.
    • The gradient through discrete choices (straight-through estimator) is an approximation; while the method works well in practice, full theoretical guarantees for the whole system are limited.

In one sentence

The paper shows how to turn the “stay under budget” rule into a smooth surface you can safely walk along while training, letting you choose the best options across a model exactly within budget—fast, reliable, and often better than common alternatives.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper proposes a Riemannian treatment of budget constraints and demonstrates promising empirical results. The following concrete gaps and open problems remain for future work:

  • Convergence with biased gradients: Provide theoretical guarantees (or counterexamples) for RCO when gradients come from Gumbel-STE and Adam with temperature annealing on the budget manifold; current convergence claims apply only to exact gradients.
  • Bias after projection: Quantify how much STE bias remains after tangent projection (beyond the normal component) and how this residual bias affects convergence rate and solution quality.
  • Retraction near saturation: Analyze numerical stability and bracketing strategies for binary-search retraction when pip_i concentrate (variance 0\to 0), which makes dC/dt=iwiVarpi[c]dC/dt=\sum_i w_i \mathrm{Var}_{p_i}[c] arbitrarily small and can slow root finding.
  • Beyond linear expected-cost constraints: Extend the geometric framework to nonlinear or non-additive costs (e.g., latency, energy, or bandwidth models) where shifting logits by tct\mathbf{c} no longer yields monotonic C(α+tc)C(\boldsymbol{\alpha}+t\mathbf{c}); develop valid retraction schemes for such constraints.
  • Per-group costs and heterogeneous option sets: Generalize the manifold and retraction to costs cikc_{ik} (option costs varying by group), or to group-specific option sets KiK_i, and characterize when the monotonic retraction still holds.
  • Multiple constraints in practice: The multi-constraint projection is outlined, but there is no empirical evaluation; study numerical conditioning when normals are nearly colinear, retraction strategies for m>1m>1, and effects on optimizer stability.
  • Inequality constraints via slack: The slack-variable construction is introduced but not validated on LLM tasks; evaluate how often optimal solutions leave slack s>0s>0, how to schedule/regularize ss, and whether slack induces undesirable degeneracies.
  • DP scalability and discretization: Assess forward-pass DP cost for large KK or fine budget discretization BB'; evaluate approximation-quality vs. speed trade-offs for coarser discretization or alternative solvers (e.g., FPTAS, Lagrangian DP, or learned surrogates).
  • Discretization error and manifold mismatch: Study the impact of discretizing the budget in DP while enforcing exact expected cost in the backward pass, including conditions under which expected-cost feasibility misaligns with discrete feasibility and how to mitigate.
  • Alternative differentiable combinatorial solvers: Compare STE to differentiable optimization approaches (e.g., perturbation-based, KKT-based, or implicit differentiation through knapsack/MCKP) for bias–variance trade-offs and sample efficiency.
  • Sample efficiency and variance reduction: Systematically analyze and design variance-reduction methods (e.g., control variates, antithetic Gumbels, Rao–Blackwellization) that reduce the need for large Gumbel sample counts per step without degrading convergence.
  • Temperature schedules: Provide principled guidelines (or adaptive schemes) for temperature annealing to balance exploration and stability, and study failure modes from overly aggressive or conservative schedules.
  • Initialization sensitivity: Evaluate how initialization of logits (e.g., from REAP or uniform) affects convergence and local minima, and whether warm starts or meta-initialization improve outcomes across budgets.
  • Equal or nearly equal costs: Investigate behavior and regularization strategies when many options share identical or nearly identical ckc_k, which can lead to small or ill-conditioned normals and slow manifold updates.
  • Cost scaling and conditioning: Examine sensitivity to the scale of ckc_k and wiw_i (e.g., whether normalization improves conditioning of projection and retraction, or interacts adversely with Adam’s adaptive scaling).
  • Retraction step improvements: Explore Newton or safeguarded Newton methods for scalar retraction with analytic derivatives (and possibly second derivatives), characterize basin sizes, and compare to binary search in speed and robustness.
  • Momentum transport choices: Compare projection-based vector transport to alternative transports (e.g., pole ladder) and assess impacts on optimizer momentum and convergence, especially under curvature.
  • Second-order methods on the manifold: Investigate Riemannian quasi-Newton or natural-gradient methods tailored to the budget manifold and compare their convergence speed and stability to Adam-based RCO.
  • Interaction with adaptive optimizers: Characterize how per-coordinate scaling (Adam, Adagrad) interacts with projection/retraction and whether manifold-aware preconditioning yields consistent improvements.
  • Memory overhead for quantization: For mixed-precision quantization, precomputing and storing per-layer weights/residuals for all bitwidths is memory-intensive; study incremental/on-the-fly quantization or low-rank residual parameterizations that reduce memory while preserving differentiability.
  • Generalization to training-time constraints: Evaluate RCO during fine-tuning or training from scratch (not just post-training calibration) under dynamic budgets, and examine stability and generalization compared to post-training application.
  • Multi-objective and compound constraints: Extend and empirically validate RCO with simultaneous constraints (e.g., model size, FLOPs, latency) and study trade-off surfaces and Pareto-front exploration on real hardware.
  • Hardware-grounded costs: Replace proxy costs (bits/parameters) with measured latency/energy on specific devices and quantify how non-additivities (caches, operator fusion) affect feasibility and the viability of monotonic retraction.
  • Broader domains: Test RCO beyond LLM compression (e.g., NAS, pipeline budgets, RL constraints) to validate generality and identify domain-specific adaptations required.
  • Evaluation breadth and robustness: Expand empirical evaluation to more models, datasets, and higher compression regimes; report sensitivity analyses across seeds, calibration set sizes, and budget levels.
  • Fair comparisons and compute accounting: Provide standardized budgets for wall-clock comparisons (same number of forward/backward passes, hardware) and ablate the DP/discretization overhead to isolate manifold vs. search contributions.
  • Theoretical link between calibration KL and task metrics: Formalize when minimizing calibration KL (Eq. LL) guarantees improvements in downstream metrics (accuracy, perplexity, coding pass@1), especially under significant compression.
  • Guarantees for discrete final solutions: Establish conditions under which optimizing expected-cost-constrained logits via RCO yields discrete assignments close to the true constrained optimum, and bound the optimality gap induced by STE and DP discretization.
  • Robustness to cost/model mismatch: Analyze sensitivity when the assumed cost model (wi,ckw_i,c_k) is mis-specified relative to actual deployment costs, and develop robust or adaptive cost-estimation procedures within RCO.

Practical Applications

Overview

The paper introduces a geometric approach to budget-constrained discrete assignment problems—common in machine learning model compression—by treating the expected-cost level set under softmax as a smooth Riemannian manifold. The proposed Riemannian Constrained Optimization (RCO) algorithm integrates: (1) tangent projection to remove budget-violating gradient components, (2) a monotonic binary-search retraction that guarantees exact budget enforcement, and (3) momentum vector transport. In the forward pass, discrete feasibility is enforced via a budget-constrained multiple-choice knapsack dynamic program (DP) combined with Gumbel Straight-Through Estimation (STE) for differentiability; in the backward pass, the manifold ensures exact budget satisfaction with no constraint hyperparameters. Empirically, RCO reaches or exceeds the quality of evolutionary search with significantly lower wall-clock time on LLM compression and matches DP optima on synthetic knapsack benchmarks where penalty methods stall.

Below are practical, real-world applications derived from these findings, grouped by deployment horizon.

Immediate Applications

These applications can be deployed now with modest engineering effort, leveraging the paper’s assumptions (linear cost in selection probabilities, known discrete options and costs, feasible DP scale, availability of a calibration set).

  • LLM post-training mixed-precision quantization under strict memory/size budgets (Software, Cloud/Edge)
    • Use RCO to assign per-layer bitwidths (e.g., 2–8 bits) that minimize calibration loss while exactly meeting an average-bit or parameter-size budget; integrates with GPTQ/other PTQ backends.
    • Tools/workflows: “RCO-Quantizer” module for PyTorch/Hugging Face Optimum; CI pipelines that auto-tune per-device budgets (A/B test across devices).
    • Assumptions/Dependencies: Pre-quantized weight candidates per bitwidth; calibration dataset; linear cost proxy (bits/parameter) and known per-layer weights; DP discretization B′ manageable.
  • MoE expert pruning with guaranteed global budgets (Software, Cloud Serving)
    • Allocate prune/keep decisions across layers to hit global expert-count or memory/compute budgets, improving inference cost without violating constraints; RCO can deviate from per-layer heuristics (e.g., REAP) when the global loss favors it.
    • Tools/workflows: “RCO-Pruner” for MoE; serving-time profiles to select budget-aware expert subsets for SKU tiers.
    • Assumptions/Dependencies: Binary per-expert costs or known per-expert cost vector; calibration data; stable STE training; DP feasibility at expert counts.
  • Non-uniform structured pruning under parameter or FLOPs budgets (Software, Mobile/Embedded)
    • Select layer-wise sparsity levels from discrete menus while minimizing end-to-end loss; preserves budget exactly at every optimizer step.
    • Tools/workflows: Integration with pruning frameworks (e.g., SPDY-like pipelines), export to ONNX/TensorRT with budget-compliant artifacts.
    • Assumptions/Dependencies: Predefined candidate sparsity levels per layer; reliable on-the-fly loss measurement; linear budget proxy (parameters/FLOPs).
  • Budget-compliant deployment across heterogeneous devices (MLOps, Edge AI)
    • Generate model variants for tiered memory/compute budgets (e.g., 1/2/4 GB) with exact budget adherence; automate SKU-specific compression profiles.
    • Tools/workflows: MLOps dashboard that runs RCO per device class; artifact registry tagging by verified budget.
    • Assumptions/Dependencies: Device budget targets and cost proxies defined; per-device calibration or transferability of calibrations.
  • Feature acquisition under fixed inference budgets (Healthcare, Finance, IoT)
    • Select subsets of features/tests with known acquisition costs to minimize task loss (e.g., risk prediction, diagnostics) under a per-query budget.
    • Tools/workflows: Online scoring services with static budgeted feature sets; offline batch planning using RCO to determine per-segment feature menus.
    • Assumptions/Dependencies: Differentiable end-task loss; known per-feature costs; discrete options per “group” (e.g., alternative measures); DP scale acceptable; ethical and regulatory checks for sensitive domains.
  • Black-box discrete configuration selection with exact resource caps (AutoML, Recsys, A/B infra)
    • Choose one configuration per component (e.g., model submodules, ensemble members, data augmenters) from discrete options to maximize validation performance under a memory or parameter budget.
    • Tools/workflows: AutoML plugins that wrap existing training/evaluation with RCO-based selection; reproducible budget compliance for experiments.
    • Assumptions/Dependencies: Linear budget proxies; discrete option catalogs; evaluation cost amortized with minibatch calibration.
  • Multi-constraint budget enforcement when costs remain linear (e.g., parameters and activation memory) (Software Systems)
    • Use the paper’s multi-constraint extension (projecting out multiple normals) to enforce several linear constraints simultaneously.
    • Tools/workflows: Compression pipelines with joint param + activation size caps for batch and sequence-length targets.
    • Assumptions/Dependencies: Each constraint linear in selection probabilities; small number of constraints; Newton retraction with closed-form derivatives.
  • Research and teaching utilities for constrained optimization (Academia)
    • Demonstrate manifold projection vs. penalties/Lagrangians; benchmark exact-budget methods on MCKP/LLM compression tasks.
    • Tools/workflows: Open-source “RCO-Lab” notebooks illustrating tangent projection, binary-search retraction, vector transport.
    • Assumptions/Dependencies: Standard ML stacks (PyTorch/JAX), moderate problem sizes for interactive demos.
  • Compliance-friendly model packaging and quota governance (Policy/Enterprise IT)
    • Produce artifacts that provably adhere to resource caps (exact budget compliance to floating-point precision), simplifying internal governance and external audits for AI resource usage.
    • Tools/workflows: Build-time reports with budget proof (residual < 1e−8), auto-blocking of non-compliant artifacts.
    • Assumptions/Dependencies: Budgets expressible via linear proxies (size/parameters); governance accepts these proxies.

Long-Term Applications

These applications require additional research, improved cost modeling (especially for nonlinear costs like latency/energy), scaling, or domain-specific validation.

  • Latency- and energy-aware optimization with realistic hardware cost models (Mobile, Cloud, Robotics)
    • Extend RCO to non-linear, hardware-dependent costs (e.g., layer latency on specific accelerators); derive or learn monotone retraction directions or employ Newton-style multi-constraint solvers with accurate gradient estimates of cost models.
    • Dependencies: Differentiable or reliably approximated latency/energy predictors; retraction for non-linear costs; validation across devices.
  • Real-time adaptive compression under dynamic budgets (Edge, On-device AI)
    • Adjust bitwidths/sparsities on the fly in response to thermal or power changes; warm-start RCO with previous logits and re-optimize quickly.
    • Dependencies: Fast calibration or proxy losses; incremental DP/reoptimization; stability under frequent updates.
  • Training-time budget control (Compute-governed training, MoE routing budgets)
    • Enforce compute/activation budgets during training by selecting layer precisions or gating decisions under a total compute budget; adapt MoE routing with hard budget constraints.
    • Dependencies: Differentiable training objectives with STE; stability of budget projection through long training; compute-cost models compatible with linear constraints or suitable surrogates.
  • Multi-objective, multi-constraint optimization (Quality–cost–fairness trade-offs)
    • Optimize for composite objectives (e.g., loss + robustness) under multiple budgets (memory, latency, energy), leveraging the manifold framework for equality and inequality constraints with slacks.
    • Dependencies: Accurate multi-objective weighting, well-conditioned multi-constraint projections, validated fairness metrics.
  • Sequential feature acquisition and decision-making under budgets (Healthcare diagnostics, Fraud detection)
    • Move from one-shot selection to sequential policies that decide which test/feature to acquire next under a cumulative budget, marrying RCO with decision processes.
    • Dependencies: Extensions to Markov decision processes or differentiable planners; safety and regulatory approval; careful evaluation of decision latency.
  • Federated and distributed resource allocation (Edge/cloud splits, Bandwidth budgets)
    • Allocate precision/pruning across client and server partitions to meet bandwidth and on-device memory budgets; select clients under communication caps.
    • Dependencies: Distributed DP or decomposed solvers; privacy constraints; synchronization/latency overheads.
  • Compiler and toolchain integration for automatic budgeted code generation (Systems, ML Compilers)
    • Integrate RCO into graph compilers (e.g., TVM, TensorRT) to automatically assign precisions and sparsities respecting per-operator and global constraints during compilation.
    • Dependencies: IR-level cost models, per-op candidate catalogs, stable integration with scheduling and kernel selection.
  • Portfolio and campaign optimization with discrete choices under spend caps (Finance, Marketing) where objectives are learned black-boxes
    • Select one tactic per segment (e.g., campaign creatives, allocation rules) to maximize predicted KPI under a budget cap while accounting for cross-effects captured by a learned model.
    • Dependencies: Differentiable surrogate of KPI; reliable mapping from selections to spend (linear or calibrated); acceptance of model-driven decisions.
  • Policy planning with discrete program menus and exact resource adherence (Public sector)
    • Choose program variants (e.g., training modules, service packages) per region under strict budget ceilings, using a learned, non-decomposable outcome model.
    • Dependencies: Trustworthy, audited predictive models; transparent cost accounting; stakeholder alignment and governance.

Notes on Key Assumptions and Dependencies

  • Linear, known costs per option and group: The budget must be expressible as C(α)=iwiEpi[c]C(\alpha)=\sum_i w_i \mathbb{E}_{p_i}[c] with non-equal option costs; this underpins the closed-form normal, tangent projection, and monotonic retraction.
  • DP scalability: The forward pass solves a multiple-choice knapsack via DP with budget discretization BB'. Very large KK or very fine-grained budgets may require approximations or coarser discretization.
  • Differentiability and STE bias: End-to-end loss must be differentiable w.r.t. logits via STE; while projection mitigates bias accumulation in momentum, full convergence guarantees with STE remain empirical.
  • Calibration data availability: RCO needs a representative calibration set for loss evaluation during search; quality of the final configuration depends on calibration-data representativeness.
  • Multi-constraint feasibility: The multiple-constraint extension assumes each constraint is linear in selection probabilities or that suitable Newton-style solvers with reliable derivatives are available.
  • Hardware cost realism: For latency/energy constraints, linear proxies may be insufficient; integrating accurate, differentiable cost models is essential for robust deployment in those regimes.

Glossary

  • Adam: A popular adaptive gradient-based optimization algorithm used for training neural networks. "Building on this, we propose Riemannian Constrained Optimization (RCO), which wraps tangent projection, binary-search retraction, and momentum transport around a standard Adam step."
  • augmented Lagrangian: A constraint-handling method that augments the Lagrangian with penalty terms to better enforce constraints. "We can compare four constraint handling methods (manifold equality, manifold with slack variable, Lagrangian, augmented Lagrangian) on the same gradient and optimizer, varying only how they enforce the budget."
  • binary-search retraction: A retraction method that uses binary search to return a point to the constraint manifold by exploiting monotonicity. "We show that the level set {C(α)=B}\{C(\boldsymbol{\alpha}) = B\} of softmax expected cost is a smooth Riemannian submanifold with closed-form normals, monotonic binary-search retraction (Proposition~\ref{prop:retraction}), and cheap vector transport (a single inner product per step)."
  • budget manifold: The constraint set defined by a fixed expected cost under softmax, viewed as a Riemannian manifold in logit space. "We call this the budget manifold and show that one can optimize directly on it, projecting gradients onto its tangent plane (the subspace of budget-preserving directions) and retracting (projecting back) onto its surface after each step."
  • codimension: The difference between the dimension of the ambient space and a submanifold; here the manifold has one fewer dimension than the ambient space. "Since MM has codimension one, vector transport (adjusting the optimizer's momentum to the new tangent plane) after retraction is another inner product."
  • Frank-Wolfe: A projection-free first-order method for constrained convex optimization that operates via linear minimization oracles. "Projection-free methods such as Frank-Wolfe can maintain feasibility by construction, but restrict the optimizer to their own update rule and do not transport adaptive state across iterates."
  • GPTQ: A post-training weight quantization method for transformers that minimizes quantization error efficiently. "we pre-quantize each layer at every candidate bitwidth via GPTQ~\citep{frantar2023gptq} and store weight residuals"
  • Gumbel noise: Random noise from the Gumbel distribution used to sample from categorical distributions via the Gumbel trick. "Concretely, each forward pass draws Gumbel noise~\citep{jang2017categorical,maddison2017concrete} GikG_{ik}, forms perturbed logits α^ik=αik/τ+Gik\hat{\alpha}_{ik} = \alpha_{ik}/\tau + G_{ik}"
  • Gumbel-STE: The Gumbel Straight-Through Estimator, combining Gumbel sampling with a straight-through gradient for discrete choices. "Gumbel-Straight-Through-Estimator (Gumbel-STE) with budget-constrained DP handles discrete feasibility in the forward pass"
  • Hessian: The matrix of second-order partial derivatives of a function, used to capture curvature; expensive to compute in high dimensions. "Obtaining the gradient (C)ik=wipik(ckEpi[c])(\nabla C)_{ik} = w_i\, p_{ik}(c_k - E_{p_i}[c]) (Eq.~\ref{eq:normal}) requires no Hessian or matrix inversion"
  • ILP: Integer Linear Programming, an optimization framework where variables are integers and constraints/objective are linear. "Sensitivity methods... allocate via DP or ILP"
  • KL divergence: Kullback–Leibler divergence; a measure of how one probability distribution diverges from another. "The loss L(z)L(\mathbf{z}^*) in Algorithm~\ref{alg:rco} is the KL divergence between the full-precision model and the model under assignment z\mathbf{z}^*"
  • Lagrangian methods: Techniques that enforce constraints by introducing Lagrange multipliers into the objective. "On hard instances, as we show in Section~\ref{sec:mckp}, Lagrangian methods oscillate."
  • level set: The set of points where a function takes on a constant value. "Therefore, by the regular value theorem, the level set M={C(α)=B}M = \{C(\boldsymbol{\alpha}) = B\} is a smooth (NK1)(NK{-}1)-dimensional Riemannian submanifold of RNKR^{NK}."
  • MILP: Mixed-Integer Linear Programming, where some variables are constrained to be integer while others are continuous in a linear optimization problem. "IMPQ~\citep{zhao2025impq} (Shapley-based surrogate with pairwise layer interactions, solved via MILP)"
  • momentum transport: Moving optimizer momentum vectors from one tangent space to another after retraction on a manifold. "we propose Riemannian Constrained Optimization (RCO), which wraps tangent projection, binary-search retraction, and momentum transport around a standard Adam step."
  • monotonic retraction: A retraction whose target function varies monotonically with the retraction parameter, enabling efficient root finding. "The closed-form derivative~\eqref{eq:retraction_deriv} also enables Newton retraction in 2--3 iterations; we use binary search for simplicity, but the multi-constraint extension... builds on Newton root-finding using the same derivative." (See Proposition title: "Monotonic retraction")
  • multiple-choice knapsack problem (MCKP): A combinatorial optimization problem where exactly one item must be chosen from each group to maximize utility under a budget. "The multiple-choice knapsack problem (MCKP) admits closed-form gradients and an exact DP solution"
  • Newton retraction: A retraction method using Newton’s method to solve for the retraction parameter using derivative information. "The closed-form derivative~\eqref{eq:retraction_deriv} also enables Newton retraction in 2--3 iterations"
  • regular value theorem: A result ensuring that preimages of regular values are smooth submanifolds. "Therefore, by the regular value theorem, the level set M={C(α)=B}M = \{C(\boldsymbol{\alpha}) = B\} is a smooth (NK1)(NK{-}1)-dimensional Riemannian submanifold of RNKR^{NK}."
  • retraction: In Riemannian optimization, a map that brings an off-manifold point back onto the manifold along a feasible path. "Optimizing on a manifold requires three operations: tangent projection (restricting gradients to the tangent plane), retraction (mapping an iterate that has drifted off the surface back onto it), and vector transport"
  • Riemannian gradient: The gradient vector field defined with respect to the manifold’s metric, i.e., projected onto the tangent space. "we verify in the appendix that... the projected gradient recovers the Riemannian gradient"
  • Riemannian manifold: A smooth manifold equipped with an inner product (metric) on each tangent space, enabling geometric notions like lengths and angles. "we observe that under softmax relaxation, the budget constraint defines a smooth Riemannian manifold in logit space"
  • Sinkhorn projections: Iterative normalization procedures to project matrices onto the set of doubly stochastic matrices. "doubly stochastic matrices require iterative Sinkhorn projections~\citep{douik2019manifold}"
  • slack variable: An auxiliary variable that converts an inequality constraint into an equality by absorbing excess budget. "A slack variable ss with C(α)+s2=BC(\boldsymbol{\alpha}) + s^2 = B extends the manifold to inequality constraints"
  • softmax Jacobian: The matrix of derivatives of the softmax function, characterizing how logits perturbations change probabilities. "the softmax Jacobian interacting with linear cost gives the level set MM a ``clean'' geometry."
  • Stiefel manifold: The set of matrices with orthonormal columns; common constraint set in optimization. "By comparison, retraction on the Stiefel manifold requires QR or polar decomposition"
  • straight-through estimator (STE): A gradient estimator that bypasses non-differentiable operations by using surrogate gradients. "the STE~\citep{bengio2013estimating} replaces the non-differentiable argmax\arg\max with soft probabilities"
  • tangent plane: The linear space of directions tangent to a manifold at a point. "At each point, the tangent plane is the subspace of directions tangent to the surface"
  • tangent projection: Projecting a vector (e.g., a gradient) onto the manifold’s tangent space to respect constraints. "Optimizing on a manifold requires three operations: tangent projection (restricting gradients to the tangent plane), retraction..., and vector transport"
  • vector transport: A rule to move tangent vectors from one point’s tangent space to another’s on a manifold. "vector transport (moving vectors such as optimizer momentum from one tangent plane to another as the iterate moves along the surface)"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 37 likes about this paper.