Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 169 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Model-Based Soft Rewards in Reinforcement Learning

Updated 13 November 2025
  • Model-based soft rewards are continuous, graded feedback signals generated via predictive or generative models to overcome sparse reward challenges.
  • They improve learning stability and sample efficiency by incorporating process-sensitive metrics such as rubric-based evaluations, smoothing kernels, and hidden state projections.
  • Applications span language modeling, vision-language tasks, and classic model-based RL, with demonstrated gains in accuracy and robustness across benchmarks.

Model-based soft rewards refer to a broad class of reinforcement learning (RL) methodologies where the reward signal is either derived or enhanced by leveraging models—either predictive, generative, or derived from internal representations—with the crucial property that the reward is continuous ("soft"), rather than sparse or strictly binary. These approaches aim to supply richer, denser, and more informative feedback, which improves learning stability, enhances sample efficiency, and enables alignment with nuanced objectives (such as logical faithfulness, human preferences, or subtasks). Model-based soft rewards span multiple research areas, including language modeling, vision-LLMs, and classic model-based RL.

1. Core Principles and Definitions

A model-based soft reward is defined as a real-valued signal r(s,a,s)Rr(s,a,s')\in\mathbb{R}, derived through a forward model (world model, generative model, or reward model) as opposed to direct, sparse environment signals or rule-based matchers. Key distinguishing aspects include:

  • Continuity: The reward signal reflects a probability, degree of correctness or alignment, or a graded expectation (rather than a binary 0/1).
  • Model dependence: The reward is adjudicated or synthesized by a model (e.g., LLM, VLM, reward network, Rubric judge), often leveraging internal structure, prediction, or meta-evaluation.
  • Process and outcome sensitivity: Many soft reward systems assess not only outcomes (final correctness) but also the process (e.g., intermediate reasoning steps or adherence to rubrics).

Exemplar implementations include:

2. Mathematical Formulations

Below are representative formulations of model-based soft rewards from several paradigms:

a) Generative Next-Token Probability

Given a verifier model πϕ\pi_\phi and a response yy, the soft reward is

rϕ(x,a,y)={πϕ(1x,a,yT),c=1 1πϕ(0x,a,yT),c=0 0,otherwiser_{\phi}(x, a, y) = \begin{cases} \pi_\phi(1 \mid x, a, y^T), & c = 1 \ 1-\pi_\phi(0 \mid x, a, y^T), & c = 0 \ 0, & \text{otherwise} \end{cases}

where cc is sampled judgment token ("yes"/"no") (Su et al., 31 Mar 2025).

b) Rubric-based Average

Given process-level rubric Cx={c1,,cm}C^x = \{c_1,\dots,c_m\} and judgments pj=Pjudge(τcj)p_j=P_\text{judge}(\tau\vDash c_j),

rrubric(τ)=1Cxj=1Cxpjr^{\text{rubric}}(\tau) = \frac{1}{|C^x|}\sum_{j=1}^{|C^x|} p_j

final reward as convex combination: r(τ)=λrans(τ)+(1λ)rrubric(τ)r(\tau) = \lambda\, r^{\text{ans}}(\tau) + (1-\lambda)\, r^{\text{rubric}}(\tau) (Jia et al., 16 Oct 2025).

c) Temporal Smoothing of Rewards

In MBRL, raw rtr_t are replaced by

r~t=i=LLfirclip(t+i,1,T)\tilde r_t = \sum_{i=-L}^{L} f_i\, r_{\text{clip}(t+i, 1, T)}

with fi=1\sum f_i = 1, ff a smoothing kernel (e.g., Gaussian, Uniform, EMA). The reward model and policy are then trained on r~t\tilde r_t rather than rtr_t (Lee et al., 2023).

d) Linear Hidden-State Rewards

Given a path's sequence of hidden states {ht}t=1T\{h_t\}_{t=1}^T, with gating gtg_t and projected reward rtr_t: R(h1:T)=t=1Tgtrtmax(t=1Tgt,ϵ)R(h_{1:T}) = \frac{\sum_{t=1}^T g_t\, r_t}{\max(\sum_{t=1}^T g_t, \epsilon)} with gt=σ(g~t)g_t = \sigma(\tilde g_t), where σ{}\sigma is the sigmoid (Guo et al., 18 May 2025).

e) Visual Preference Soft Reward

For chain-of-thought VLMs outputting logits f(riC)f(r_i \mid C) for rating levels rir_i: Rsoftsingle(IP)=ririp(riC),p(riC)=softmax(f(ri,C))R^{\text{single}}_{\text{soft}}(I \mid P) = \sum_{r_i} r_i\, p(r_i \mid C)\,, \quad p(r_i \mid C) = \mathrm{softmax}(f(r_i, C)) (Gambashidze et al., 25 Mar 2025).

3. Construction Methodologies

a) Self-Aggregation and Rubric Mining

AutoRubric-R1V constructs process-level rubrics by aggregating KK chain-of-thought rollouts that yield correct answers. Rubric criteria are identified as steps with support frequency f(c)θf(c)\geq\theta among correct τ\tau, then used for reward judging via LLM prompts. Problems with <3<3 correct rollouts are discarded; GPT-OSS-20B is used as a frozen judge. Problem-specificity of rubrics is essential for effective process rewards (Jia et al., 16 Oct 2025).

b) Reward Model Training

Soft reward models are typically trained on distilled signal from large off-the-shelf models adjudicating outputs of smaller actors. For example, the RM-7B reward model in (Su et al., 31 Mar 2025) is finetuned via binary cross-entropy loss using labels generated by a 72B verifier, relying on online generation for diversity and robustness of the reward model. No stepwise rationales or rationales are required; noisy labeling via teacher models is sufficient.

c) Smoothing Kernels in MBRL

In DreamSmooth, smoothing kernels such as Gaussian, Uniform, or EMA are selected by hyperparameters (σ,δ,α)(\sigma, \delta, \alpha) and applied symmetrically or causally (past-only) to raw trajectory rewards. Smoothed rewards then serve as targets for the learned reward model (Lee et al., 2023).

d) Reward from Hidden State or Logits

ELHSR introduces linear projections atop LLM hidden states (or logits) to score paths with minimal compute. Two local linear heads project flattened token embeddings to gating and reward logits, aggregated across the trajectory and trained on binary correctness (Guo et al., 18 May 2025).

e) Chain-of-Thought Soft Preference

VLMs are prompted with “Let me think step by step:” and generate intermediate tokens before producing a final rating, which is mapped into a probability or expected value over a rating scale or pairwise preference (Gambashidze et al., 25 Mar 2025).

4. Integration into Learning Algorithms

a) Policy Gradient and Advantage Estimation

Soft reward signals are used directly in policy gradients and their normalized variants for RL. In (Su et al., 31 Mar 2025), the soft reward is z-score normalized in each minibatch, and policy updates proceed via REINFORCE, REINFORCE++ (with baselines), or RLOO, optionally regularized by KL divergence to a reference policy.

b) Group Relative Policy Optimization (GRPO)

GRPO is leveraged in AutoRubric-R1V and VLM preference RL to stably integrate combined (answer+rubric, or traditional+soft) reward signals. Rewards for each rollout are normalized to group mean and std, and the clipped advantage forms the policy objective; a KL penalty to reference stabilizes learning (Jia et al., 16 Oct 2025, Gambashidze et al., 25 Mar 2025).

c) Model-Based RL Backbones

In MBRL, smoothed rewards are inserted as regression targets into existing world models (e.g., DreamerV3), without altering planning or policy update logic. Actor-critic updates or planning with TD-MPC/MBPO can exploit “softened” reward predictions for more robust long-term planning (Lee et al., 2023).

d) Best-of-N Selection with Soft Reward Models

At inference, a set of NN candidate outputs are scored by the soft reward model (e.g., ELHSR on hidden states/logits), selecting the sample with the highest score for output (Guo et al., 18 May 2025). This paradigm is efficient and suitable for both open and closed-source LLMs.

5. Empirical Results and Benchmarking

Key empirical findings across domains:

Method/Domain Main Metric Baseline (Top Competitor) Soft Reward Result Gain
AutoRubric-R1V (MLLM) Avg. acc. 47.29% (base), 54.06% 54.81% +7.52pp over base
Faithfulness 21.8% (GRPO) 12.6% ~9pp lower inconsistency
DreamSmooth (MBRL) Return, sample eff. DreamerV3 DreamSmooth Faster/higher returns, esp. on sparse tasks; no loss elsewhere
RM-7B (LLM RLVR, multi-domain) Accuracy 57.2% (math RB), 24.2% (multi) 62.3%, 30.3% +5.1pp, +6.1pp; OOD robust
ELHSR (LLM reasoning, BoN@16) Accuracy 48.4% (Skywork) 54.6% (MATH) +6.2pp
VLM Soft Reward (ImageReward) Mean@1 51.0% (zero-shot) 64.9% Matches single human annotator (65.1%)

Soft reward methods generally achieve smoother reward landscapes, improved stability, and continue to scale with increased RL exposure (contrasted with rule-based that peak/degrade) (Jia et al., 16 Oct 2025, Lee et al., 2023, Su et al., 31 Mar 2025, Guo et al., 18 May 2025, Gambashidze et al., 25 Mar 2025).

6. Advantages, Limitations, and Pitfalls

Documented Advantages

  • Reduction of spurious reasoning: Rubric and process-level rewards discourage "shortcut" solutions, yielding more faithful intermediate steps (Jia et al., 16 Oct 2025).
  • Improved sample efficiency and asymptotic performance: Reward smoothing unlocks learning in extremely sparse reward regimes and does not harm saturated domains (Lee et al., 2023).
  • Fine-grained, nuanced feedback: Soft signals support policy improvements along difficult axes (partially correct, nearly aligned, etc.) (Su et al., 31 Mar 2025, Gambashidze et al., 25 Mar 2025).
  • Computational efficiency: ELHSR provides state-of-the-art soft rewards using sub-millisecond overhead on CPU, vastly lighter than traditional reward models (Guo et al., 18 May 2025).
  • Robustness and generalization: Model-based rewards are less brittle to noisy/ambiguous structure in free-form domains and continue to improve with scale (Su et al., 31 Mar 2025).

Noted Limitations

  • Vulnerability to flawed consensus: Self-aggregation for rubrics may reinforce common but incorrect reasoning styles (Jia et al., 16 Oct 2025).
  • Model/judge reliability: Systematic errors in the reward model or LLM judge directly translate into noisy or biased reward signals.
  • Inference/training overhead: Model-based judgement and reasoning (e.g., chain-of-thought rollouts, rubric checks) increase runtime relative to rule-based binary checks (Jia et al., 16 Oct 2025, Gambashidze et al., 25 Mar 2025).
  • Potential for leakage: Symmetric reward smoothing can "leak" future signals into present (temporal credit assignment violations), especially in environments like Crafter (Lee et al., 2023).
  • Domain specificity: Rubrics and reward models mined in one domain or dataset are often not transferrable; process-specific constructions must be rebuilt for new data distributions (Jia et al., 16 Oct 2025, Su et al., 31 Mar 2025).

Potential failure modes include collapsed (overly generic) rubrics, and weak underlying model representations yielding noisy or uninformative soft rewards.

7. Implications and Future Directions

Model-based soft rewards comprise a paradigm shift for RL in process-oriented, long-horizon, and ambiguous domains. They facilitate:

  • Enhanced process supervision: Direct enforcement of logically coherent multi-step reasoning.
  • Unified reward modeling: Applicability to both structured and free-form, language and vision, task and subtask settings.
  • Efficiency and privacy: Tiny in-model reward heads (e.g., ELHSR on logits) enable on-device reward computations for privacy and speed.
  • Generality in reward shaping: Temporal smoothing, rubric-guided checkpoints, and internal-state rewards provide flexible, extensible templates for soft reward construction.

Open questions remain regarding the best integration of stepwise/process-level signals, learning dynamic or adaptive reward kernels, extending to richer verdict spaces (beyond binary or scalar), and leveraging soft rewards for exploration or in human-in-the-loop RL systems. Nevertheless, model-based soft reward frameworks underpin state-of-the-art advances in reasoning faithfulness, preference alignment, and efficient learning across complex real-world tasks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Model-Based Soft Rewards.