Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward-Based Scaling (RBS) Overview

Updated 4 July 2026
  • Reward-Based Scaling (RBS) is a framework that redefines reward optimization by dynamically adjusting magnitudes, compositions, and contrasts.
  • It employs techniques such as dynamic reweighting, pair-specific scaling, and rubric-based adjustments to address signal saturation and optimize learning.
  • RBS methods enhance performance by adaptively emphasizing informative signals and reducing failure modes like reward overoptimization and hacking.

Reward-Based Scaling (RBS) is a family of methods that modifies reward magnitudes, reward composition, reward contrast, or the effect of reward on optimization, rather than treating reward as a fixed scalar objective. In the literature represented here, RBS includes direct scalar rescaling, dynamic re-weighting of extrinsic and intrinsic terms, pair-specific preference scaling, criterion-wise rubric reallocation, reward-model scaling, and inference-time reward-guided search. The common objective is to reshape the reward landscape so that optimization emphasizes the signals that are most informative, least saturated, or most reliable for the task and stage of learning (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026).

1. Conceptual scope and formal patterns

A minimal formulation of RBS writes the optimized reward as a weighted composition of multiple signals,

Rtotal(s,a)=Rtask(s,a)+kλk(t)Rintrinsic(k)(s,a),R_{\text{total}}(s,a) = R_{\text{task}}(s,a) + \sum_k \lambda_k(t)\,R_{\text{intrinsic}}^{(k)}(s,a),

with time-varying coefficients λk(t)\lambda_k(t) that determine how much each component contributes to learning. In DRTA, this appears as

Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),

where R1R_1 is a classification reward and R2R_2 is a VAE reconstruction-error term; the coefficient is updated per episode by

λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),

so low episodic return increases the weight on the intrinsic signal and high episodic return decreases it (Golchin et al., 25 Aug 2025).

A second pattern scales rewards at the level of individual preference pairs. Adaptive Preference Scaling introduces a pair-specific variable τi\tau_i and optimizes losses of the form

τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,

derived from a KL-constrained distributionally robust optimization objective. The learned τi\tau_i is small for ambiguous pairs and large for clear preferences, so weak evidence is compressed and strong evidence is amplified (Hong et al., 2024).

A third pattern scales reward dimensions rather than examples. Under rubric-based rewards, Focal Reward computes a saturation estimate P(k)P^{(k)} for each criterion and then rescales criterion weights as

λk(t)\lambda_k(t)0

thereby reallocating optimization pressure toward criteria with more headroom (Huang et al., 26 May 2026).

These patterns show that RBS is broader than simple multiplication by a scalar. It includes magnitude rescaling, reward shaping, adaptive reweighting, contrast enhancement, and reward-dependent importance weighting (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026).

Mode of RBS Representative formulation Representative papers
Dynamic component weighting λk(t)\lambda_k(t)1 (Golchin et al., 25 Aug 2025)
Pair-specific preference scaling λk(t)\lambda_k(t)2 (Hong et al., 2024)
Criterion-wise rubric scaling λk(t)\lambda_k(t)3 (Huang et al., 26 May 2026)
Reward-dependent group weighting λk(t)\lambda_k(t)4 (Huang et al., 25 Jun 2026)
Generative reward scoring λk(t)\lambda_k(t)5 (Wu et al., 10 Sep 2025)

2. Dynamic reweighting of exploration, difficulty, and rubric criteria

In sequential decision problems, RBS is often used to balance exploration against exploitation. DRTA formulates time-series anomaly detection as DQN-based binary classification over sliding-window states, with λk(t)\lambda_k(t)6 defined by asymmetric TP/TN/FP/FN rewards and λk(t)\lambda_k(t)7 given by VAE reconstruction error. The fixed classification magnitudes are λk(t)\lambda_k(t)8, λk(t)\lambda_k(t)9, Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),0, and Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),1, which explicitly encode that missing anomalies is far worse than false alarms. The dynamic coefficient Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),2 starts high, encouraging exploration of windows with high reconstruction error, and decreases as episodic reward approaches the target, shifting emphasis toward classification correctness (Golchin et al., 25 Aug 2025).

DyRef applies RBS to grouped policy optimization for multi-reference image generation. Its Difficulty-aware Advantage Reweighting uses the group mean reward

Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),3

as a difficulty proxy and assigns weights

Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),4

so hard, low-reward groups receive higher loss weight. Its Discriminative Reward Scaling then transforms sample rewards with a sigmoid sharpening function,

Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),5

which enlarges intra-group reward differences. On OmniRef-Bench, the full method reaches an average MLLM score of Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),6, compared with Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),7 without DAR and Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),8 without DRS, indicating that inter-group reweighting and intra-group contrast address distinct failure modes (Huang et al., 25 Jun 2026).

Focal Reward addresses a different pathology: static scalarization of rubric-based rewards causes “imbalanced reward polarization,” in which easy criteria saturate and continue to dominate the scalar reward while hard criteria remain under-optimized. Its inverse reward projection estimates criterion saturation on the reward frontier, and focal weights then shift training toward under-satisfied criteria. Across three model scales and six benchmarks, Focal Reward outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons, and the paper attributes the gains to online, saturation-aware reallocation toward rubric dimensions that still have room for improvement (Huang et al., 26 May 2026).

A plausible implication is that dynamic reward scaling is most useful when the information value of different reward channels changes during training. This is explicit in DRTA’s shift from unsupervised reconstruction signals to labeled classification reward, in DyRef’s shift toward hard groups and more discriminative intra-group gradients, and in Focal Reward’s shift away from saturated rubric dimensions (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026, Huang et al., 26 May 2026).

3. Reward models as scaling objects

Another major strand of RBS treats the reward model itself as the object being scaled. RewardDance replaces regressive scalar heads with a generative reward defined as the probability of a “yes” token,

Rtotal(st,at)=R1(st,at)+λ(t)R2(st,at),R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),9

where the context includes prompt, images, task-aware instructions, and optionally chain-of-thought. This creates two scaling axes: model scaling, with reward models from 1B to 26B parameters, and context scaling, through instructions, reference examples, and CoT. On text-to-image RL fine-tuning, Seedream-3.0-SFT improves from R1R_10 with no RM to R1R_11 with the 26B reward model; OOD reward-model accuracy rises from R1R_12 at 1B to R1R_13 at 26B. The paper further argues that larger reward models maintain higher reward variance during RL, which it interprets as resistance to reward hacking and mode collapse (Wu et al., 10 Sep 2025).

ESFP-RM advances a different scaling thesis: LM-based judging reward modeling is formally consistent with natural language inference, so reward-model quality scales with “comprehension boundaries.” It defines a scoring function

R1R_14

and combines explanation generation with masked slot prediction. On eSNLI, autoregressive models without explanations achieve roughly R1R_15–R1R_16 accuracy, autoregressive models with explanations reach roughly R1R_17–R1R_18, but MLMs with explanations achieve about R1R_19–R2R_20. The same framework yields more stable and generalizable reward signals in RLHF and OOD evaluation, with DeBERTa-large ESFP-RM reaching an average of R2R_21 on RMB and SHP (Ning et al., 25 Aug 2025).

CodeScaler demonstrates that reward-model scaling need not depend on explicit execution at deployment time. It trains an execution-free reward model R2R_22 on carefully curated code preferences and uses validity-preserving reward shaping,

R2R_23

so invalid code is always worse than valid code. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of R2R_24 points and outperforms binary execution-based RL by R2R_25 points. At inference time it matches or exceeds strong unit-test-based baselines with a R2R_26-fold reduction in latency (Zhu et al., 4 Feb 2026).

These results support a broad claim: RBS is not limited to modifying downstream scalar rewards. It also includes scaling the representational capacity, contextual bandwidth, and structural alignment of reward models themselves (Wu et al., 10 Sep 2025, Ning et al., 25 Aug 2025, Zhu et al., 4 Feb 2026).

4. Inference-time scaling, personalization, and rubric systems

RBS also appears at inference time, where reward functions guide search without updating model parameters. CARINOX optimizes and explores diffusion-model initial noise using a composite reward

R2R_27

with R2R_28 HPS, R2R_29 ImageReward, λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),0 DA Score, and λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),1 VQA Score. It combines multi-start search with gradient ascent in noise space and reports λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),2 average alignment gains on T2I-CompBench++ and λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),3 on HRS. For SD-Turbo on T2I-CompBench++, the mean score rises from λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),4 to λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),5 under CARINOX (Kasaei et al., 22 Sep 2025).

P-GenRM extends RBS to personalized alignment. It generates structured evaluation chains containing persona descriptions and scoring rubrics, then applies dual-granularity test-time user-based scaling: λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),6 The first term averages multiple chains for the same user; the second imports signals from similar users via learned user prototypes. The system reports an average improvement of λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),7 on personalized reward-model benchmarks, and test-time user-based scaling provides an additional λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),8 boost (Zhang et al., 12 Feb 2026).

Rubric systems generalize this idea by replacing opaque scalar judgments with explicit criteria. OpenRubrics constructs synthetic rubric data through Contrastive Rubric Generation and preference-label consistency filtering, and Rubric-RM surpasses strong size-matched baselines by λt+1=clip(λt+α(RtargetRepisode),λmin,λmax),\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),9 on reward-modeling benchmarks (Liu et al., 9 Oct 2025). OpenRS goes further by using Pairwise Adaptive Meta-Rubrics and Pointwise Verifiable Rubrics. Its pairwise adaptive score is

τi\tau_i0

and its RL reward combines subjective pairwise judgment with verifiable components,

τi\tau_i1

As reward supervision for pairwise RL training, OpenRS improves the average across five benchmarks from τi\tau_i2 with a scalar RM to τi\tau_i3 (Jia et al., 15 Feb 2026).

A plausible implication is that inference-time RBS is especially effective when the underlying generator is frozen or expensive to retrain, and when multiple candidates, user contexts, or rubric dimensions can be cheaply re-scored relative to one another (Kasaei et al., 22 Sep 2025, Zhang et al., 12 Feb 2026, Jia et al., 15 Feb 2026).

5. Overoptimization, calibration, and evaluation

RBS also includes the study of how reward optimization fails. “Scaling Laws for Reward Model Overoptimization” formalizes the distinction between a gold reward model τi\tau_i4 and a proxy reward model τi\tau_i5, and shows that optimizing the proxy eventually degrades gold performance. For Best-of-τi\tau_i6, gold reward follows

τi\tau_i7

where τi\tau_i8. For RL, the empirical law is

τi\tau_i9

Larger and better-trained reward models increase the safe optimization region, but overoptimization does not disappear (Gao et al., 2022).

RewardBench 2 provides a benchmark-level perspective on this problem. It reports that models score about 20 points lower on RewardBench 2 than on the first RewardBench, and that RewardBench 2 average correlates strongly with best-of-τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,0 performance, with Pearson correlation τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,1 across 113 reward models. The same paper finds that high RewardBench 2 accuracy is necessary but not sufficient for PPO success: once reward models are “good enough,” PPO performance saturates for on-policy, in-distribution RMs, while lineage mismatches and prompt-distribution mismatches can degrade RLHF even for reward models with high benchmark scores (Malik et al., 2 Jun 2025).

Classical deep RL work shows that even simple scalar reward rescaling materially affects optimization. ANS studies reward scaling in ReLU-based RL and argues that reward scaling is not equivalent to learning-rate tuning. It proposes Adaptive Network Scaling to search for a suitable reward scale and transfer the critic via layer-wise rescaling. On MuJoCo, A2C + ANS reaches τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,2 on HalfCheetah-v2 compared with τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,3 for vanilla A2C with ReLU, while DDPG + ANS reaches τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,4 on HalfCheetah-v2 compared with τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,5 for vanilla DDPG with ReLU (Wu et al., 2018).

These results correct a common misconception: better reward signals, larger reward models, or larger reward magnitudes do not simply monotonically improve optimization. They improve some regimes, alter stability conditions, and widen safe operating regions, but they also create new failure modes centered on Goodhart effects, calibration drift, reward hacking, and distribution mismatch (Gao et al., 2022, Malik et al., 2 Jun 2025, Wu et al., 2018).

6. Limitations and current research directions

Several limitations recur across the literature. Dynamic weighting methods often expose new hyperparameters, such as τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,6, τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,7, τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,8, τilogσ ⁣(r(zw,i)r(zl,i)τi)+ρτi,-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,9, and τi\tau_i0 in DRTA, or τi\tau_i1, τi\tau_i2, τi\tau_i3, and τi\tau_i4 in DyRef and Focal Reward; systematic sensitivity analyses are often limited (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026, Huang et al., 26 May 2026). Non-potential-based shaping, as in DRTA, does not preserve policy invariance guarantees (Golchin et al., 25 Aug 2025). Large reward models and test-time scaling mechanisms remain computationally expensive, as illustrated by RewardDance’s 26B reward models and CARINOX’s substantial runtime and VRAM costs (Wu et al., 10 Sep 2025, Kasaei et al., 22 Sep 2025). Rubric systems and personalized judges improve transparency and controllability, but they also introduce new questions about fairness, prototype bias, rubric misspecification, and the possibility of optimizing the surface form of criteria rather than the intended latent preference (Zhang et al., 12 Feb 2026, Jia et al., 15 Feb 2026, Liu et al., 9 Oct 2025).

An important new direction is reward modeling without explicit human supervision. “Scaling Reward Modeling without Human Supervision” defines RBS as preference learning over document prefixes and suffixes from raw web corpora and trains reward models with in-batch Bradley–Terry losses plus a centering regularizer. Training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, with improvements of up to τi\tau_i5 points on RewardBench v2 average, τi\tau_i6 on in-domain math subsets, and consistent gains on out-of-domain safety and general subsets. When used for best-of-τi\tau_i7 selection and GRPO-style policy optimization, these unsupervised reward models substantially improve downstream math performance and can match or exceed strong supervised baselines of similar size (Fan et al., 11 Feb 2026).

Taken together, the literature suggests that RBS has become a general design principle rather than a single technique. It now spans scalar reward tuning, adaptive shaping, contrast amplification, rubric-conditioned judging, reward-model scaling, user-conditioned aggregation, inference-time search, and unsupervised preference induction. The central technical question is no longer whether reward should be scaled, but which aspect of the reward pipeline—magnitude, composition, contrast, representation, context, or evaluation protocol—should be scaled for a given optimization regime and failure model (Golchin et al., 25 Aug 2025, Wu et al., 10 Sep 2025, Fan et al., 11 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward-Based Scaling (RBS).