Reward-Based Scaling (RBS) Overview

Updated 4 July 2026

Reward-Based Scaling (RBS) is a framework that redefines reward optimization by dynamically adjusting magnitudes, compositions, and contrasts.
It employs techniques such as dynamic reweighting, pair-specific scaling, and rubric-based adjustments to address signal saturation and optimize learning.
RBS methods enhance performance by adaptively emphasizing informative signals and reducing failure modes like reward overoptimization and hacking.

Reward-Based Scaling (RBS) is a family of methods that modifies reward magnitudes, reward composition, reward contrast, or the effect of reward on optimization, rather than treating reward as a fixed scalar objective. In the literature represented here, RBS includes direct scalar rescaling, dynamic re-weighting of extrinsic and intrinsic terms, pair-specific preference scaling, criterion-wise rubric reallocation, reward-model scaling, and inference-time reward-guided search. The common objective is to reshape the reward landscape so that optimization emphasizes the signals that are most informative, least saturated, or most reliable for the task and stage of learning (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026).

1. Conceptual scope and formal patterns

A minimal formulation of RBS writes the optimized reward as a weighted composition of multiple signals,

$R_{\text{total}}(s,a) = R_{\text{task}}(s,a) + \sum_k \lambda_k(t)\,R_{\text{intrinsic}}^{(k)}(s,a),$

with time-varying coefficients $\lambda_k(t)$ that determine how much each component contributes to learning. In DRTA, this appears as

$R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$

where $R_1$ is a classification reward and $R_2$ is a VAE reconstruction-error term; the coefficient is updated per episode by

$\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$

so low episodic return increases the weight on the intrinsic signal and high episodic return decreases it (Golchin et al., 25 Aug 2025).

A second pattern scales rewards at the level of individual preference pairs. Adaptive Preference Scaling introduces a pair-specific variable $\tau_i$ and optimizes losses of the form

$-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$

derived from a KL-constrained distributionally robust optimization objective. The learned $\tau_i$ is small for ambiguous pairs and large for clear preferences, so weak evidence is compressed and strong evidence is amplified (Hong et al., 2024).

A third pattern scales reward dimensions rather than examples. Under rubric-based rewards, Focal Reward computes a saturation estimate $P^{(k)}$ for each criterion and then rescales criterion weights as

$\lambda_k(t)$ 0

thereby reallocating optimization pressure toward criteria with more headroom (Huang et al., 26 May 2026).

These patterns show that RBS is broader than simple multiplication by a scalar. It includes magnitude rescaling, reward shaping, adaptive reweighting, contrast enhancement, and reward-dependent importance weighting (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026).

Mode of RBS	Representative formulation	Representative papers
Dynamic component weighting	$\lambda_k(t)$ 1	(Golchin et al., 25 Aug 2025)
Pair-specific preference scaling	$\lambda_k(t)$ 2	(Hong et al., 2024)
Criterion-wise rubric scaling	$\lambda_k(t)$ 3	(Huang et al., 26 May 2026)
Reward-dependent group weighting	$\lambda_k(t)$ 4	(Huang et al., 25 Jun 2026)
Generative reward scoring	$\lambda_k(t)$ 5	(Wu et al., 10 Sep 2025)

2. Dynamic reweighting of exploration, difficulty, and rubric criteria

In sequential decision problems, RBS is often used to balance exploration against exploitation. DRTA formulates time-series anomaly detection as DQN-based binary classification over sliding-window states, with $\lambda_k(t)$ 6 defined by asymmetric TP/TN/FP/FN rewards and $\lambda_k(t)$ 7 given by VAE reconstruction error. The fixed classification magnitudes are $\lambda_k(t)$ 8, $\lambda_k(t)$ 9, $R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 0, and $R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 1, which explicitly encode that missing anomalies is far worse than false alarms. The dynamic coefficient $R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 2 starts high, encouraging exploration of windows with high reconstruction error, and decreases as episodic reward approaches the target, shifting emphasis toward classification correctness (Golchin et al., 25 Aug 2025).

DyRef applies RBS to grouped policy optimization for multi-reference image generation. Its Difficulty-aware Advantage Reweighting uses the group mean reward

$R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 3

as a difficulty proxy and assigns weights

$R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 4

so hard, low-reward groups receive higher loss weight. Its Discriminative Reward Scaling then transforms sample rewards with a sigmoid sharpening function,

$R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 5

which enlarges intra-group reward differences. On OmniRef-Bench, the full method reaches an average MLLM score of $R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 6, compared with $R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 7 without DAR and $R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 8 without DRS, indicating that inter-group reweighting and intra-group contrast address distinct failure modes (Huang et al., 25 Jun 2026).

Focal Reward addresses a different pathology: static scalarization of rubric-based rewards causes “imbalanced reward polarization,” in which easy criteria saturate and continue to dominate the scalar reward while hard criteria remain under-optimized. Its inverse reward projection estimates criterion saturation on the reward frontier, and focal weights then shift training toward under-satisfied criteria. Across three model scales and six benchmarks, Focal Reward outperforms the strongest static aggregation baseline in all 18 model-benchmark comparisons, and the paper attributes the gains to online, saturation-aware reallocation toward rubric dimensions that still have room for improvement (Huang et al., 26 May 2026).

A plausible implication is that dynamic reward scaling is most useful when the information value of different reward channels changes during training. This is explicit in DRTA’s shift from unsupervised reconstruction signals to labeled classification reward, in DyRef’s shift toward hard groups and more discriminative intra-group gradients, and in Focal Reward’s shift away from saturated rubric dimensions (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026, Huang et al., 26 May 2026).

3. Reward models as scaling objects

Another major strand of RBS treats the reward model itself as the object being scaled. RewardDance replaces regressive scalar heads with a generative reward defined as the probability of a “yes” token,

$R_{\text{total}}(s_t,a_t)=R_1(s_t,a_t)+\lambda(t)R_2(s_t,a_t),$ 9

where the context includes prompt, images, task-aware instructions, and optionally chain-of-thought. This creates two scaling axes: model scaling, with reward models from 1B to 26B parameters, and context scaling, through instructions, reference examples, and CoT. On text-to-image RL fine-tuning, Seedream-3.0-SFT improves from $R_1$ 0 with no RM to $R_1$ 1 with the 26B reward model; OOD reward-model accuracy rises from $R_1$ 2 at 1B to $R_1$ 3 at 26B. The paper further argues that larger reward models maintain higher reward variance during RL, which it interprets as resistance to reward hacking and mode collapse (Wu et al., 10 Sep 2025).

ESFP-RM advances a different scaling thesis: LM-based judging reward modeling is formally consistent with natural language inference, so reward-model quality scales with “comprehension boundaries.” It defines a scoring function

$R_1$ 4

and combines explanation generation with masked slot prediction. On eSNLI, autoregressive models without explanations achieve roughly $R_1$ 5– $R_1$ 6 accuracy, autoregressive models with explanations reach roughly $R_1$ 7– $R_1$ 8, but MLMs with explanations achieve about $R_1$ 9– $R_2$ 0. The same framework yields more stable and generalizable reward signals in RLHF and OOD evaluation, with DeBERTa-large ESFP-RM reaching an average of $R_2$ 1 on RMB and SHP (Ning et al., 25 Aug 2025).

CodeScaler demonstrates that reward-model scaling need not depend on explicit execution at deployment time. It trains an execution-free reward model $R_2$ 2 on carefully curated code preferences and uses validity-preserving reward shaping,

$R_2$ 3

so invalid code is always worse than valid code. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of $R_2$ 4 points and outperforms binary execution-based RL by $R_2$ 5 points. At inference time it matches or exceeds strong unit-test-based baselines with a $R_2$ 6-fold reduction in latency (Zhu et al., 4 Feb 2026).

These results support a broad claim: RBS is not limited to modifying downstream scalar rewards. It also includes scaling the representational capacity, contextual bandwidth, and structural alignment of reward models themselves (Wu et al., 10 Sep 2025, Ning et al., 25 Aug 2025, Zhu et al., 4 Feb 2026).

4. Inference-time scaling, personalization, and rubric systems

RBS also appears at inference time, where reward functions guide search without updating model parameters. CARINOX optimizes and explores diffusion-model initial noise using a composite reward

$R_2$ 7

with $R_2$ 8 HPS, $R_2$ 9 ImageReward, $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 0 DA Score, and $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 1 VQA Score. It combines multi-start search with gradient ascent in noise space and reports $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 2 average alignment gains on T2I-CompBench++ and $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 3 on HRS. For SD-Turbo on T2I-CompBench++, the mean score rises from $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 4 to $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 5 under CARINOX (Kasaei et al., 22 Sep 2025).

P-GenRM extends RBS to personalized alignment. It generates structured evaluation chains containing persona descriptions and scoring rubrics, then applies dual-granularity test-time user-based scaling: $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 6 The first term averages multiple chains for the same user; the second imports signals from similar users via learned user prototypes. The system reports an average improvement of $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 7 on personalized reward-model benchmarks, and test-time user-based scaling provides an additional $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 8 boost (Zhang et al., 12 Feb 2026).

Rubric systems generalize this idea by replacing opaque scalar judgments with explicit criteria. OpenRubrics constructs synthetic rubric data through Contrastive Rubric Generation and preference-label consistency filtering, and Rubric-RM surpasses strong size-matched baselines by $\lambda_{t+1}=\mathrm{clip}\bigl(\lambda_t+\alpha(R_{\text{target}}-R_{\text{episode}}),\lambda_{\min},\lambda_{\max}\bigr),$ 9 on reward-modeling benchmarks (Liu et al., 9 Oct 2025). OpenRS goes further by using Pairwise Adaptive Meta-Rubrics and Pointwise Verifiable Rubrics. Its pairwise adaptive score is

$\tau_i$ 0

and its RL reward combines subjective pairwise judgment with verifiable components,

$\tau_i$ 1

As reward supervision for pairwise RL training, OpenRS improves the average across five benchmarks from $\tau_i$ 2 with a scalar RM to $\tau_i$ 3 (Jia et al., 15 Feb 2026).

A plausible implication is that inference-time RBS is especially effective when the underlying generator is frozen or expensive to retrain, and when multiple candidates, user contexts, or rubric dimensions can be cheaply re-scored relative to one another (Kasaei et al., 22 Sep 2025, Zhang et al., 12 Feb 2026, Jia et al., 15 Feb 2026).

5. Overoptimization, calibration, and evaluation

RBS also includes the study of how reward optimization fails. “Scaling Laws for Reward Model Overoptimization” formalizes the distinction between a gold reward model $\tau_i$ 4 and a proxy reward model $\tau_i$ 5, and shows that optimizing the proxy eventually degrades gold performance. For Best-of- $\tau_i$ 6, gold reward follows

$\tau_i$ 7

where $\tau_i$ 8. For RL, the empirical law is

$\tau_i$ 9

Larger and better-trained reward models increase the safe optimization region, but overoptimization does not disappear (Gao et al., 2022).

RewardBench 2 provides a benchmark-level perspective on this problem. It reports that models score about 20 points lower on RewardBench 2 than on the first RewardBench, and that RewardBench 2 average correlates strongly with best-of- $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 0 performance, with Pearson correlation $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 1 across 113 reward models. The same paper finds that high RewardBench 2 accuracy is necessary but not sufficient for PPO success: once reward models are “good enough,” PPO performance saturates for on-policy, in-distribution RMs, while lineage mismatches and prompt-distribution mismatches can degrade RLHF even for reward models with high benchmark scores (Malik et al., 2 Jun 2025).

Classical deep RL work shows that even simple scalar reward rescaling materially affects optimization. ANS studies reward scaling in ReLU-based RL and argues that reward scaling is not equivalent to learning-rate tuning. It proposes Adaptive Network Scaling to search for a suitable reward scale and transfer the critic via layer-wise rescaling. On MuJoCo, A2C + ANS reaches $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 2 on HalfCheetah-v2 compared with $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 3 for vanilla A2C with ReLU, while DDPG + ANS reaches $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 4 on HalfCheetah-v2 compared with $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 5 for vanilla DDPG with ReLU (Wu et al., 2018).

These results correct a common misconception: better reward signals, larger reward models, or larger reward magnitudes do not simply monotonically improve optimization. They improve some regimes, alter stability conditions, and widen safe operating regions, but they also create new failure modes centered on Goodhart effects, calibration drift, reward hacking, and distribution mismatch (Gao et al., 2022, Malik et al., 2 Jun 2025, Wu et al., 2018).

6. Limitations and current research directions

Several limitations recur across the literature. Dynamic weighting methods often expose new hyperparameters, such as $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 6, $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 7, $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 8, $-\tau_i \log \sigma\!\Big(\frac{r(z_{w,i})-r(z_{l,i})}{\tau_i}\Big)+\rho\,\tau_i,$ 9, and $\tau_i$ 0 in DRTA, or $\tau_i$ 1, $\tau_i$ 2, $\tau_i$ 3, and $\tau_i$ 4 in DyRef and Focal Reward; systematic sensitivity analyses are often limited (Golchin et al., 25 Aug 2025, Huang et al., 25 Jun 2026, Huang et al., 26 May 2026). Non-potential-based shaping, as in DRTA, does not preserve policy invariance guarantees (Golchin et al., 25 Aug 2025). Large reward models and test-time scaling mechanisms remain computationally expensive, as illustrated by RewardDance’s 26B reward models and CARINOX’s substantial runtime and VRAM costs (Wu et al., 10 Sep 2025, Kasaei et al., 22 Sep 2025). Rubric systems and personalized judges improve transparency and controllability, but they also introduce new questions about fairness, prototype bias, rubric misspecification, and the possibility of optimizing the surface form of criteria rather than the intended latent preference (Zhang et al., 12 Feb 2026, Jia et al., 15 Feb 2026, Liu et al., 9 Oct 2025).

An important new direction is reward modeling without explicit human supervision. “Scaling Reward Modeling without Human Supervision” defines RBS as preference learning over document prefixes and suffixes from raw web corpora and trains reward models with in-batch Bradley–Terry losses plus a centering regularizer. Training on 11M tokens of math-focused web data yields steady gains on RewardBench v1 and v2, with improvements of up to $\tau_i$ 5 points on RewardBench v2 average, $\tau_i$ 6 on in-domain math subsets, and consistent gains on out-of-domain safety and general subsets. When used for best-of- $\tau_i$ 7 selection and GRPO-style policy optimization, these unsupervised reward models substantially improve downstream math performance and can match or exceed strong supervised baselines of similar size (Fan et al., 11 Feb 2026).

Taken together, the literature suggests that RBS has become a general design principle rather than a single technique. It now spans scalar reward tuning, adaptive shaping, contrast amplification, rubric-conditioned judging, reward-model scaling, user-conditioned aggregation, inference-time search, and unsupervised preference induction. The central technical question is no longer whether reward should be scaled, but which aspect of the reward pipeline—magnitude, composition, contrast, representation, context, or evaluation protocol—should be scaled for a given optimization regime and failure model (Golchin et al., 25 Aug 2025, Wu et al., 10 Sep 2025, Fan et al., 11 Feb 2026).