Abstract: We consider the problem of Cost-Aware Learning, where sampling different component functions of a finite-sum objective incurs different costs. The objective is to reach a target error while minimizing the total cost. First, we propose the Cost-Aware Stochastic Gradient Descent algorithm for convex functions, and derive its cost complexity to attain an error of $ε$. Furthermore, we establish a lower bound for this setting and provide a subset selection algorithm to further reduce the cost of training. We apply our theoretical insights to reinforcement learning with LLMs, where the computational cost of policy gradients varies with sequence length. To this end, we introduce Cost-Aware GRPO, an algorithm designed to reduce the cost of policy optimization while preserving performance. Empirical results on 1.5B and 8B LLMs demonstrate that our approach reduces the tokens used in policy optimization by up to about 30% while matching or exceeding baseline accuracy.
The paper introduces Cost-Aware SGD, a variant of projected SGD that explicitly minimizes training cost by accounting for non-uniform per-sample computational expenses.
It derives an optimal sampling distribution based on a gradient proxy and evaluation cost, yielding explicit cost-dependent convergence bounds and outperforming traditional methods.
The research proposes a Min-Cost Knapsack subset selection strategy to reduce token usage with minimal accuracy loss, demonstrating robust efficiency gains in RLHF settings.
Cost-Aware Learning: Theory and Practice
Motivation and Problem Formulation
The paper "Cost-Aware Learning" (2604.28020) targets the paradigm shift in supervised and reinforcement learning, where the computational expense of querying individual samples is non-uniform and often known in advance. Traditional SGD analyses assume sample-level cost uniformity, but application domains such as RL with LLMs reveal substantial variance in per-example cost (e.g., FLOPs scaling with sequence length). The central problem is to minimize the total incurred computational cost to reach an ϵ-accurate solution of a finite-sum convex objective, explicitly considering sample-level cost heterogeneity.
Cost-Aware SGD: Theoretical Foundations
The authors introduce Cost-Aware SGD, a principled variant of projected SGD, designed to minimize expected training cost for convex and strongly convex objectives under non-uniform per-sample costs. The analysis derives the optimal sampling distribution p∗:
pi∗∝Gi/ci
where Gi is a Lipschitz constant (upper bound of per-sample gradient norm), and ci is evaluation cost. The optimality criterion is minimizing the product of the gradient estimator variance and expected cost-per-iteration, leading to explicit cost-dependent convergence bounds.
Cost Complexity and Lower Bound
The paper rigorously proves that the minimum expected cost to reach ϵ error is proportional to (∑iGici)2, establishing a gap over traditional uniform and variance-optimal importance sampling baselines (which ignore cost information). The lower bound construction shows that dependency on this term is fundamental; no unbiased estimator can asymptotically improve the cost scaling beyond this result. Additionally, classical importance sampling (proportional to Gi) may be strictly sub-optimal in highly correlated gradient-cost regimes.
Practical Extensions: Subset Selection
To further reduce training cost, the paper introduces a controlled-bias subset selection strategy, formalized as a Min-Cost Knapsack problem, selecting a subset that minimizes aggregate cost subject to a tunable bias constraint on excluded gradients. The greedy approximation yields efficient scaling by selecting samples with lowest ci, guaranteeing $2$-approximation of the optimal coverage-cost tradeoff. Empirical validation demonstrates significant cost reductions with minimal accuracy loss for reasonable bias budgets.
Cost-Aware Policy Optimization for LLMs
The theoretical insights are operationalized in the RLHF setting for LLM post-training, focusing on the policy gradient step. The authors augment GRPO with cost-aware importance sampling, defining sample cost as total prompt and response token length (proxy for FLOPs), and using advantage p∗0 as an estimate for p∗1. Empirical distribution construction and importance weighting integrate seamlessly with standard GRPO pipelines, preserving performance while reducing optimization cost.
Empirical Results
Cost-Aware GRPO exhibits strong empirical results on Qwen2.5-Math-1.5B-Instruct and Qwen3-8B models across MATH500, AMC, GSM8K, and AIME benchmarks. Notably:
Performance: In multiple settings, cost-aware variants attain equivalent or superior final accuracy compared to standard GRPO and GRPO+ZVF, demonstrating robustness and efficacy.
Proxy Fidelity: High Pearson correlation between advantage magnitude and gradient norm, as well as near-zero cost-biased p∗2-divergence, substantiates the practical proxy for the optimal theoretical distribution.
Generalizability: Cost-aware sampling strategies are robust to objective variations (CISPO, ZVF), noise in gradient proxy estimation, and smoothed sampling distributions.
Implications, Limitations, and Future Directions
The implications are substantial for large-scale foundation model training and data-efficient RL algorithms. The methodology directly addresses the scalability bottleneck imposed by sequence length heterogeneity in LLM RL, providing an actionable recipe to minimize hardware cost without sacrificing convergence guarantees or performance. The theoretical analysis extends with minimal loss in generality to non-convex regimes, adaptive optimizers, and actively evolving datasets.
Practically, this opens avenues for:
Integration into RLHF and RLVR Algorithms: The cost-aware framework is compatible with direct preference optimization, PPO, CISPO, and zero-variance filtering, allowing seamless adoption in state-of-the-art RL for LLMs.
Proxy Optimization: Identification and validation of alternative proxies for gradient norm estimation may further enhance sampling fidelity, especially as LLM architectures evolve.
Theoretical Generalization: Extending the finite-sum analysis to more general stochastic regimes and non-convex objectives remains a meaningful direction.
Conclusion
"Cost-Aware Learning" (2604.28020) formalizes and solves the problem of efficient learning under non-uniform sample-level computational costs, both theoretically and empirically. The proposed Cost-Aware SGD and downstream policy optimization algorithms exhibit substantial improvements in training efficiency, aligning variance reduction with computational constraints. The results demonstrate robust, generalizable gains in token/FLOPs efficiency for post-train LLM RL, with maintained or improved downstream accuracy, motivating broader adoption and further theoretical exploration.
“Emergent Mind helps me see which AI papers have caught fire online.”
Philip
Creator, AI Explained on YouTube
Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.Discover trending papers, chat with arXiv, and more.