Efficiency-Aware Reward: Success vs. Cost
- Efficiency-aware reward is a design principle that evaluates task success jointly with the resources expended, such as compute, samples, or tokens.
- It employs methods like compute-aware tree search, beta-shaped reward shaping, and conservative alignment to balance performance and resource cost.
- Its applications range from test-time reasoning to embodied control, enhancing sample complexity and operational efficiency without sacrificing quality.
Efficiency-aware reward denotes reward design or reward use in which task success is evaluated jointly with the resources expended to obtain, estimate, or optimize that success. In recent work, the term appears in external test-time reasoning as accuracy under a fixed inference budget, in sparse-reward reinforcement learning as sample-efficient and computationally efficient shaping, in generative alignment as reward-aware but conservative preference optimization, and in stochastic control as an explicit efficiency–reward trade-off between operating reward and congestion (Song et al., 23 May 2025, Ma et al., 2024, Wang et al., 26 May 2026, Qu et al., 30 Jan 2026). Across these settings, the common target is not reward in isolation but reward relative to compute, samples, tokens, queried feedback, tool calls, or queue length; the theoretical backdrop includes reward shaping for improved sample complexity and the observation that, in linear MDPs, reward-free RL can be no harder than reward-aware RL in dimension dependence (Gupta et al., 2022, Wagenmaker et al., 2022).
1. Conceptual scope
The literature uses efficiency-aware reward in multiple but related senses. In external test-time reasoning, the reward signal is required to reflect not only whether a reasoning path is correct but also the compute it costs to obtain, with efficiency realized by explicitly penalizing compute and favoring reward separation between the best and second-best path (Song et al., 23 May 2025). In sparse-reward RL, efficiency-aware reward shaping densifies otherwise delayed feedback while remaining computationally cheap at scale, as in self-adaptive success-rate shaping with KDE and Random Fourier Features (Ma et al., 2024). In human-in-the-loop or expensive-reward environments, efficiency-aware reward instead means reducing dependence on costly reward queries by substituting a learned proxy whenever confidence is high (Satici et al., 28 Feb 2025).
A second strand treats reward computation itself as an efficiency bottleneck. Diffusion alignment methods such as listwise reward-aware preference optimization, offline Arena-based fine-grained rewards, and reward-aware trajectory shaping aim to preserve rich preference information while avoiding the online cost of RLHF or the rigidity of pairwise binary supervision (Wang et al., 26 May 2026, Li et al., 7 May 2026, Li et al., 16 Apr 2026). Large-reasoning-model training extends the same principle to token efficiency: efficiency rewards target shorter reasoning traces, but only under conditions that prevent collapse, reward hacking, or brevity bias (Lee et al., 21 Jun 2026, Xie et al., 5 Jun 2026, Liu et al., 18 Jun 2026).
This breadth implies that efficiency-aware reward is better understood as a design principle than as a single objective. The shared premise is that reward should encode not just task success, but the marginal value of the resources consumed while pursuing that success.
2. Formal objectives and efficiency criteria
A canonical formulation appears in compute-aware test-time reasoning. Given a frozen policy and a budget , the objective is to choose search and sampling hyperparameters to maximize correctness:
$\psi^*(B) = \arg\max_{\psi}\,\mathbb{E}_{a\sim\mathrm{Target}(\psi,B,q)}\big[\mathbbm{1}\{a=a^*(q)\}\big].$
In this setting, efficiency metrics include accuracy at fixed , accuracy per sample or token, expected samples to reach target accuracy , and time-to-accuracy; the budget may be modeled as the total number of candidate paths, decoding tokens, or cumulative tree expansions (Song et al., 23 May 2025).
A different formalization appears when reward acquisition itself is costly. Adaptive confidence discounting introduces a query variable and a proxy reward
with the implicit efficiency objective of maintaining return while reducing the number of queried rewards (Satici et al., 28 Feb 2025). In this view, efficiency is measured in feedback budget rather than inference budget.
Queueing control makes the trade-off explicit as a constrained optimization problem. The operator minimizes steady-state congestion subject to near-optimal reward:
where is the reward regret under a state-dependent arrival policy (Qu et al., 30 Jan 2026). This formalism is notable because efficiency is not a proxy metric but the primary constrained quantity.
In token-efficient multimodal reasoning, efficiency is sometimes evaluated directly as
0
which makes the trade-off operational during reinforcement learning rather than only at deployment time (Liu et al., 18 Jun 2026). Earlier RL theory gives a complementary perspective: UCBVI-Shaped improves sample complexity through bonus scaling and value projection, whereas in linear MDPs the worst-case separation between reward-aware and reward-free exploration disappears in dimension dependence (Gupta et al., 2022, Wagenmaker et al., 2022).
3. Core design patterns
A first pattern is direct compute-aware reward shaping. In Compute-Aware Tree Search, the per-step reward explicitly balances compute cost, a PRM-estimated margin term, and the maximum PRM score:
1
The associated analysis ties efficiency to reward model generalization: lower PRM generalization error 2 and larger reward margin 3 reduce mis-ranking risk and therefore the number of required samples or expansions (Song et al., 23 May 2025).
A second pattern is dense shaping under explicit uncertainty. SASR defines an additive shaped reward
4
where the Beta variance is high early and low later, yielding exploration first and exploitation later. The success-rate estimates are computed non-parametrically through KDE accelerated by Random Fourier Features, so the shaping is intended to be both sample-efficient and computationally efficient (Ma et al., 2024).
A third pattern is conservative reward-aware alignment. Diffusion LAIR converts continuous groupwise rewards into centered advantage weights and optimizes a listwise objective
5
with bounded optimum 6. Here 7 controls the magnitude of the implicit reward and the conservativeness of the induced preference shift (Wang et al., 26 May 2026). ArenaPO follows a related efficiency logic but computes a fine-grained pairwise gap offline from an Arena and truncated-normal latent-variable inference, thereby avoiding reward-model training and additional online overhead (Li et al., 7 May 2026).
A fourth pattern is lightweight or internal reward modeling. ELHSR scores a path by a gated linear readout over hidden states:
8
Because the features are already produced during decoding, reward evaluation becomes a near-free add-on relative to text-output reward models (Guo et al., 18 May 2025).
A fifth pattern is gated efficiency pressure. ACOER applies efficiency rewards only to correct completions,
9
in order to eliminate the structural collapse induced by continuous length penalties on incorrect answers under GRPO (Lee et al., 21 Jun 2026). SlimSearcher uses a strict correctness gate together with cohort-relative tool and token efficiency, whereas CARE modulates reasoning-length rewards by competence stage and batch-relative effort through
0
thereby shifting from exploration-oriented long reasoning to efficiency-oriented concise reasoning as competence rises (Xie et al., 5 Jun 2026, Liu et al., 18 Jun 2026). A common implication is that efficiency-aware reward is not equivalent to a raw length penalty.
4. Representative instantiations across domains
| Setting | Reward construction | Exemplars |
|---|---|---|
| External test-time reasoning | Compute-aware search reward, PRM generalization, margin-aware allocation | CATS (Song et al., 23 May 2025) |
| Sparse-reward RL and costly feedback | Additive Beta-shaped rewards; confidence-gated proxy rewards | SASR (Ma et al., 2024), adaptive confidence discounting (Satici et al., 28 Feb 2025) |
| Diffusion and generative alignment | Listwise centered advantages, offline fine-grained rewards, reward-aware gates | LAIR (Wang et al., 26 May 2026), ArenaPO (Li et al., 7 May 2026), RATS (Li et al., 16 Apr 2026) |
| Reward model efficiency | Hidden-state or logits-based linear scoring | ELHSR (Guo et al., 18 May 2025) |
| Web, reasoning, and video agents | Correctness-gated tool/token efficiency; competence-aware length shaping; dual semantic-temporal rewards | SlimSearcher (Xie et al., 5 Jun 2026), ACOER (Lee et al., 21 Jun 2026), CARE (Liu et al., 18 Jun 2026), VideoLLM RLT (Li et al., 2 Jun 2025) |
| Embodied and multi-agent control | Context-conditioned reward weights, motion-aware reward selection, differentiated state-transition reward | VL-PR (Tian et al., 29 Jun 2026), MA-ROESL (Wang et al., 13 May 2025), differentiated reward for cooperative driving (Han et al., 1 Feb 2025) |
The embodied-control variants show how efficiency-aware reward extends beyond token or sample budgets. In robotic endovascular navigation, a multimodal LLM infers procedural context and computes a posterior over phases; reward is then a context-averaged Frobenius inner product between a component matrix and phase-specific weight matrices, so that traversal efficiency, safety, deviation, and bending are reweighted online rather than globally fixed (Tian et al., 29 Jun 2026). In locomotion-from-video, MA-ROESL improves reward quality indirectly through motion-aware frame selection and then evaluates candidate rewards offline with IQL before final online fine-tuning, thereby shifting much of reward optimization away from repeated simulator interaction (Wang et al., 13 May 2025).
Multi-agent traffic control and hierarchical sparse-reward RL demonstrate yet another usage. The differentiated reward method for cooperative driving introduces a position reward weighted by instantaneous velocity components, so that small but meaningful state-transition gradients become visible even in near steady-state traffic (Han et al., 1 Feb 2025). ALCS instead structures reward hierarchically: low-level intrinsic rewards are subtask-specific and dense, whereas the high-level policy is updated on extrinsic task reward using the subtask actually achieved, which is intended to improve sample efficiency without manually designing a full reward machine (Han et al., 2024).
5. Empirical behavior
| Paper | Representative reported outcome | Efficiency modality |
|---|---|---|
| CATS (Song et al., 23 May 2025) | With Qwen2.5-7B, approximately 98.0% on GSM8K, 89.4% on MATH, and 58.4% on OlympiadBench | Accuracy-per-compute |
| SASR (Ma et al., 2024) | AntStand 39.12 ± 2.86 vs ReLara 28.66 ± 1.82; RobotPush 137.06 ± 12.66 vs ReLara 58.71 ± 6.98 | Sample efficiency and convergence stability |
| Adaptive confidence discounting (Satici et al., 28 Feb 2025) | Fetch-Push reaches similar or better success with ~216,000 rewards vs HER baseline ~961,000 rewards | Reward-query efficiency |
| VL-PR (Tian et al., 29 Jun 2026) | Coronary: 100% success (30/30), 41.6±2.1 steps vs SAC 70% (21/30), 80.6±3.5 steps | Navigational efficiency and reliability |
| MA-ROESL (Wang et al., 13 May 2025) | Average training time 14.49 h to 4.54 h, a 68.67% reduction | Wall-clock efficiency |
| ACOER (Lee et al., 21 Jun 2026) | MATH-500 88.4% with 2,134 tokens vs base model 88.8% with 5,553 tokens | Token efficiency |
| SlimSearcher (Xie et al., 5 Jun 2026) | GAIA rounds 20.56 to 10.61 and accuracy 0.682 to 0.709 | Tool-call and token efficiency |
These results share two notable properties. First, efficiency-aware reward often improves or preserves task quality rather than merely trading it away. In video reasoning, CARE improves scores on VSI-Bench, VideoMMMU, MVBench, TempCompass, and VideoMME while also exhibiting an inverted-U trajectory of reasoning length and higher token efficiency (Liu et al., 18 Jun 2026). In video reinforcement learning tuning, a variance-aware 32k curated subset outperforms larger-data baselines across temporal grounding and video QA tasks, indicating that reward informativeness can substitute for raw dataset scale (Li et al., 2 Jun 2025).
Second, several systems report that stability depends as much on how efficiency enters the reward as on the magnitude of the efficiency signal. RATS improves the efficiency–quality trade-off in few-step generation without adding inference-time overhead, but does so through reward-aware gating rather than a fixed distillation target (Li et al., 16 Apr 2026). Conversely, ACOER shows that continuous length penalties on incorrect answers under GRPO can collapse performance even when the penalty is very small, so the empirical gain is inseparable from the gating structure itself (Lee et al., 21 Jun 2026).
6. Limitations, misconceptions, and open problems
A persistent misconception is that efficiency-aware reward is simply reward minus a length or compute penalty. The recent literature does not support that reduction. Some methods use explicit penalties, but many instead rely on correctness gates, confidence thresholds, context posteriors, batch-relative normalization, or offline reward surrogates. This is partly because the underlying assumptions are fragile: PRM-based TTR bounds depend on bounded deviations, independence assumptions, and prior choice in PAC-Bayes analysis; adaptive confidence discounting relies on entropy as a heuristic confidence proxy; and queueing guarantees change sharply between small-market and large-market regimes and between concave-like and non-concave-like reward functions (Song et al., 23 May 2025, Satici et al., 28 Feb 2025, Qu et al., 30 Jan 2026).
A second misconception is that efficiency shaping is automatically policy-preserving. SASR explicitly uses additive shaping rather than potential-based shaping and therefore does not guarantee policy invariance (Ma et al., 2024). Diffusion alignment methods remain dependent on the quality of offline rewards or reward surrogates: biased reward scores, imperfect capability estimates, or noisy listwise rewards can misdirect optimization even when the update is conservative (Wang et al., 26 May 2026, Li et al., 7 May 2026). In embodied settings, context misclassification or inference latency can transiently misprioritize reward components, and motion-aware or VLM-based reward generation remains vulnerable to perception noise and domain shift (Tian et al., 29 Jun 2026, Wang et al., 13 May 2025).
The most acute current controversy concerns stability under RL-style normalization. ACOER isolates a structural failure mode in GRPO: continuous penalties on incorrect answers create a collapse loop, whereas correct-only efficiency rewards remove that primary mechanism but still require dynamic budget normalization and control-loop adaptation to avoid stochastic over-compression (Lee et al., 21 Jun 2026). CARE reaches a similar conclusion from the multimodal side: fixed normalization confounds verbosity with task complexity, so competence-aware, batch-relative control is required for stable length adaptation (Liu et al., 18 Jun 2026).
Open problems follow directly from these limitations. External test-time reasoning calls for tighter bounds that connect PRM generalization error, reward margin, and token-level budgets, including ranking-specific or uncertainty-aware variants (Song et al., 23 May 2025). Sparse-reward shaping still lacks formal convergence guarantees in the self-adaptive, additive setting and invites richer uncertainty models than a Beta distribution over KDE-based counts (Ma et al., 2024). Reward-aware and reward-free exploration in linear MDPs remain separated by horizon factors rather than dimension, leaving the full efficiency picture incomplete (Wagenmaker et al., 2022). Taken together, these results suggest that efficiency-aware reward remains an active synthesis problem: it is simultaneously about what to optimize, when to apply that pressure, and how to ensure that the reward itself is cheap, calibrated, and structurally safe.