Model-Based Soft Rewards in Reinforcement Learning

Updated 13 November 2025

Model-based soft rewards are continuous, graded feedback signals generated via predictive or generative models to overcome sparse reward challenges.
They improve learning stability and sample efficiency by incorporating process-sensitive metrics such as rubric-based evaluations, smoothing kernels, and hidden state projections.
Applications span language modeling, vision-language tasks, and classic model-based RL, with demonstrated gains in accuracy and robustness across benchmarks.

Model-based soft rewards refer to a broad class of reinforcement learning (RL) methodologies where the reward signal is either derived or enhanced by leveraging models—either predictive, generative, or derived from internal representations—with the crucial property that the reward is continuous ("soft"), rather than sparse or strictly binary. These approaches aim to supply richer, denser, and more informative feedback, which improves learning stability, enhances sample efficiency, and enables alignment with nuanced objectives (such as logical faithfulness, human preferences, or subtasks). Model-based soft rewards span multiple research areas, including language modeling, vision-LLMs, and classic model-based RL.

1. Core Principles and Definitions

A model-based soft reward is defined as a real-valued signal $r(s,a,s')\in\mathbb{R}$ , derived through a forward model (world model, generative model, or reward model) as opposed to direct, sparse environment signals or rule-based matchers. Key distinguishing aspects include:

Continuity: The reward signal reflects a probability, degree of correctness or alignment, or a graded expectation (rather than a binary 0/1).
Model dependence: The reward is adjudicated or synthesized by a model (e.g., LLM, VLM, reward network, Rubric judge), often leveraging internal structure, prediction, or meta-evaluation.
Process and outcome sensitivity: Many soft reward systems assess not only outcomes (final correctness) but also the process (e.g., intermediate reasoning steps or adherence to rubrics).

Exemplar implementations include:

Confidence scores from generative next-token distributions as the reward (Su et al., 31 Mar 2025, Gambashidze et al., 25 Mar 2025).
Rubric-based criteria checked by LLM judges at reasoning checkpoints, yielding averaged fulfillment scores (Jia et al., 16 Oct 2025).
Temporally smoothed predictive rewards in MBRL, computed by averaging over a causal or symmetric kernel (Lee et al., 2023).
Continuous rewards from linear projections of model hidden states or logits (Guo et al., 18 May 2025).

2. Mathematical Formulations

Below are representative formulations of model-based soft rewards from several paradigms:

a) Generative Next-Token Probability

Given a verifier model $\pi_\phi$ and a response $y$ , the soft reward is

$r_{\phi}(x, a, y) = \begin{cases} \pi_\phi(1 \mid x, a, y^T), & c = 1 \ 1-\pi_\phi(0 \mid x, a, y^T), & c = 0 \ 0, & \text{otherwise} \end{cases}$

where $c$ is sampled judgment token ("yes"/"no") (Su et al., 31 Mar 2025).

b) Rubric-based Average

Given process-level rubric $C^x = \{c_1,\dots,c_m\}$ and judgments $p_j=P_\text{judge}(\tau\vDash c_j)$ ,

$r^{\text{rubric}}(\tau) = \frac{1}{|C^x|}\sum_{j=1}^{|C^x|} p_j$

final reward as convex combination: $r(\tau) = \lambda\, r^{\text{ans}}(\tau) + (1-\lambda)\, r^{\text{rubric}}(\tau)$ (Jia et al., 16 Oct 2025).

c) Temporal Smoothing of Rewards

In MBRL, raw $r_t$ are replaced by

$\tilde r_t = \sum_{i=-L}^{L} f_i\, r_{\text{clip}(t+i, 1, T)}$

with $\sum f_i = 1$ , $f$ a smoothing kernel (e.g., Gaussian, Uniform, EMA). The reward model and policy are then trained on $\tilde r_t$ rather than $r_t$ (Lee et al., 2023).

d) Linear Hidden-State Rewards

Given a path's sequence of hidden states $\{h_t\}_{t=1}^T$ , with gating $g_t$ and projected reward $r_t$ : $R(h_{1:T}) = \frac{\sum_{t=1}^T g_t\, r_t}{\max(\sum_{t=1}^T g_t, \epsilon)}$ with $g_t = \sigma(\tilde g_t)$ , where ${}\sigma$ is the sigmoid (Guo et al., 18 May 2025).

e) Visual Preference Soft Reward

For chain-of-thought VLMs outputting logits $f(r_i \mid C)$ for rating levels $r_i$ : $R^{\text{single}}_{\text{soft}}(I \mid P) = \sum_{r_i} r_i\, p(r_i \mid C)\,, \quad p(r_i \mid C) = \mathrm{softmax}(f(r_i, C))$ (Gambashidze et al., 25 Mar 2025).

3. Construction Methodologies

a) Self-Aggregation and Rubric Mining

AutoRubric-R1V constructs process-level rubrics by aggregating $K$ chain-of-thought rollouts that yield correct answers. Rubric criteria are identified as steps with support frequency $f(c)\geq\theta$ among correct $\tau$ , then used for reward judging via LLM prompts. Problems with $<3$ correct rollouts are discarded; GPT-OSS-20B is used as a frozen judge. Problem-specificity of rubrics is essential for effective process rewards (Jia et al., 16 Oct 2025).

b) Reward Model Training

Soft reward models are typically trained on distilled signal from large off-the-shelf models adjudicating outputs of smaller actors. For example, the RM-7B reward model in (Su et al., 31 Mar 2025) is finetuned via binary cross-entropy loss using labels generated by a 72B verifier, relying on online generation for diversity and robustness of the reward model. No stepwise rationales or rationales are required; noisy labeling via teacher models is sufficient.

c) Smoothing Kernels in MBRL

In DreamSmooth, smoothing kernels such as Gaussian, Uniform, or EMA are selected by hyperparameters $(\sigma, \delta, \alpha)$ and applied symmetrically or causally (past-only) to raw trajectory rewards. Smoothed rewards then serve as targets for the learned reward model (Lee et al., 2023).

d) Reward from Hidden State or Logits

ELHSR introduces linear projections atop LLM hidden states (or logits) to score paths with minimal compute. Two local linear heads project flattened token embeddings to gating and reward logits, aggregated across the trajectory and trained on binary correctness (Guo et al., 18 May 2025).

e) Chain-of-Thought Soft Preference

VLMs are prompted with “Let me think step by step:” and generate intermediate tokens before producing a final rating, which is mapped into a probability or expected value over a rating scale or pairwise preference (Gambashidze et al., 25 Mar 2025).

4. Integration into Learning Algorithms

a) Policy Gradient and Advantage Estimation

Soft reward signals are used directly in policy gradients and their normalized variants for RL. In (Su et al., 31 Mar 2025), the soft reward is z-score normalized in each minibatch, and policy updates proceed via REINFORCE, REINFORCE++ (with baselines), or RLOO, optionally regularized by KL divergence to a reference policy.

b) Group Relative Policy Optimization (GRPO)

GRPO is leveraged in AutoRubric-R1V and VLM preference RL to stably integrate combined (answer+rubric, or traditional+soft) reward signals. Rewards for each rollout are normalized to group mean and std, and the clipped advantage forms the policy objective; a KL penalty to reference stabilizes learning (Jia et al., 16 Oct 2025, Gambashidze et al., 25 Mar 2025).

c) Model-Based RL Backbones

In MBRL, smoothed rewards are inserted as regression targets into existing world models (e.g., DreamerV3), without altering planning or policy update logic. Actor-critic updates or planning with TD-MPC/MBPO can exploit “softened” reward predictions for more robust long-term planning (Lee et al., 2023).

d) Best-of-N Selection with Soft Reward Models

At inference, a set of $N$ candidate outputs are scored by the soft reward model (e.g., ELHSR on hidden states/logits), selecting the sample with the highest score for output (Guo et al., 18 May 2025). This paradigm is efficient and suitable for both open and closed-source LLMs.

5. Empirical Results and Benchmarking

Key empirical findings across domains:

Method/Domain	Main Metric	Baseline (Top Competitor)	Soft Reward Result	Gain
AutoRubric-R1V (MLLM)	Avg. acc.	47.29% (base), 54.06%	54.81%	+7.52pp over base
	Faithfulness	21.8% (GRPO)	12.6%	~9pp lower inconsistency
DreamSmooth (MBRL)	Return, sample eff.	DreamerV3	DreamSmooth	Faster/higher returns, esp. on sparse tasks; no loss elsewhere
RM-7B (LLM RLVR, multi-domain)	Accuracy	57.2% (math RB), 24.2% (multi)	62.3%, 30.3%	+5.1pp, +6.1pp; OOD robust
ELHSR (LLM reasoning, BoN@16)	Accuracy	48.4% (Skywork)	54.6% (MATH)	+6.2pp
VLM Soft Reward (ImageReward)	Mean@1	51.0% (zero-shot)	64.9%	Matches single human annotator (65.1%)

Soft reward methods generally achieve smoother reward landscapes, improved stability, and continue to scale with increased RL exposure (contrasted with rule-based that peak/degrade) (Jia et al., 16 Oct 2025, Lee et al., 2023, Su et al., 31 Mar 2025, Guo et al., 18 May 2025, Gambashidze et al., 25 Mar 2025).

6. Advantages, Limitations, and Pitfalls

Documented Advantages

Reduction of spurious reasoning: Rubric and process-level rewards discourage "shortcut" solutions, yielding more faithful intermediate steps (Jia et al., 16 Oct 2025).
Improved sample efficiency and asymptotic performance: Reward smoothing unlocks learning in extremely sparse reward regimes and does not harm saturated domains (Lee et al., 2023).
Fine-grained, nuanced feedback: Soft signals support policy improvements along difficult axes (partially correct, nearly aligned, etc.) (Su et al., 31 Mar 2025, Gambashidze et al., 25 Mar 2025).
Computational efficiency: ELHSR provides state-of-the-art soft rewards using sub-millisecond overhead on CPU, vastly lighter than traditional reward models (Guo et al., 18 May 2025).
Robustness and generalization: Model-based rewards are less brittle to noisy/ambiguous structure in free-form domains and continue to improve with scale (Su et al., 31 Mar 2025).

Noted Limitations

Vulnerability to flawed consensus: Self-aggregation for rubrics may reinforce common but incorrect reasoning styles (Jia et al., 16 Oct 2025).
Model/judge reliability: Systematic errors in the reward model or LLM judge directly translate into noisy or biased reward signals.
Inference/training overhead: Model-based judgement and reasoning (e.g., chain-of-thought rollouts, rubric checks) increase runtime relative to rule-based binary checks (Jia et al., 16 Oct 2025, Gambashidze et al., 25 Mar 2025).
Potential for leakage: Symmetric reward smoothing can "leak" future signals into present (temporal credit assignment violations), especially in environments like Crafter (Lee et al., 2023).
Domain specificity: Rubrics and reward models mined in one domain or dataset are often not transferrable; process-specific constructions must be rebuilt for new data distributions (Jia et al., 16 Oct 2025, Su et al., 31 Mar 2025).

Potential failure modes include collapsed (overly generic) rubrics, and weak underlying model representations yielding noisy or uninformative soft rewards.

7. Implications and Future Directions

Model-based soft rewards comprise a paradigm shift for RL in process-oriented, long-horizon, and ambiguous domains. They facilitate:

Enhanced process supervision: Direct enforcement of logically coherent multi-step reasoning.
Unified reward modeling: Applicability to both structured and free-form, language and vision, task and subtask settings.
Efficiency and privacy: Tiny in-model reward heads (e.g., ELHSR on logits) enable on-device reward computations for privacy and speed.
Generality in reward shaping: Temporal smoothing, rubric-guided checkpoints, and internal-state rewards provide flexible, extensible templates for soft reward construction.

Open questions remain regarding the best integration of stepwise/process-level signals, learning dynamic or adaptive reward kernels, extending to richer verdict spaces (beyond binary or scalar), and leveraging soft rewards for exploration or in human-in-the-loop RL systems. Nevertheless, model-based soft reward frameworks underpin state-of-the-art advances in reasoning faithfulness, preference alignment, and efficient learning across complex real-world tasks.