RLERR: Reinforcement Learning with Reflection Rewards

Updated 26 January 2026

The paper introduces RLERR, a framework that assigns rewards to internal reflective actions at token, segment, and trajectory levels to enhance self-correction.
It employs dual reward mechanisms that balance outcome accuracy with reflection quality, significantly improving performance in both unimodal and multimodal settings.
Empirical evaluations show that integrating reflection rewards stabilizes policy optimization and overcomes sparse reward challenges in complex decision-making tasks.

Reinforcement Learning with Effective Reflection Rewards (RLERR) encompasses a suite of methodologies in reinforcement learning (RL) for language and reasoning models that explicitly incorporate reflection-oriented reward signals into training. Unlike standard RL approaches that focus only on final answer correctness or environment-defined objectives, RLERR formalizes and internalizes self-reflection—processes where the agent analyzes, critiques, or revises its own reasoning—by providing direct credit assignment for effective reflective behavior. Contemporary RLERR frameworks address both unimodal and multimodal settings, scale from hundreds of millions to tens of billions of parameters, and have demonstrated state-of-the-art gains on complex language, reasoning, and decision-making benchmarks.

1. Problem Setting and Motivation

In conventional RL for LLMs and large reasoning models (LRMs), reward signals are sparse and often limited to final output correctness, e.g., matching a ground truth answer in question answering or succeeding in an API call. However, such sparse rewards do not guide the intermediate reasoning steps, nor do they differentiate between superficial and meaningful self-reflection. Several critical failure patterns have been identified in RL-based retrieval-augmented generation (RAG) and general reasoning models: (i) information insufficiency (retrieval fails to provide needed support), (ii) faulty logic (errors despite sufficient knowledge), and (iii) answer-reasoning inconsistency (a valid reasoning chain does not yield a correct answer) (He et al., 30 Jul 2025).

RLERR was introduced to tackle these issues by incentivizing trajectories that not only converge to the right answer, but also demonstrate high-quality introspection and self-correction, either via explicit reflection tokens or differentiated reward assignments to reflection-related behaviors (He et al., 30 Jul 2025, Bensal et al., 30 May 2025, Wang et al., 19 Jan 2026, Wan et al., 2 Jun 2025, Deng et al., 26 May 2025, Min et al., 2024).

2. Core Reward Formulations

Reflection rewards in RLERR methods are constructed at multiple granularities, including token-level, segment-level, and trajectory-level scoring. The general principle is to assign reward credit not only for final success but also for internal reflective steps that demonstrably lead to improved outcomes or higher-quality reasoning.

2.1 Explicit Binary Reflection Reward

In RLERR’s original form for self-improving LLMs (Bensal et al., 30 May 2025), the basic episode structure is two-stage:

On error, the model generates a natural-language reflection, then retries.
Only the tokens in the reflection segment are rewarded, and only if the retry is successful:

$R_t = \begin{cases} 1 & \text{if retry succeeds (v₂ = 1) and t \in \text{reflection}}, \ 0 & \text{otherwise} \end{cases}$

This focus on targeted reward for successful self-reflection accelerates the emergence of concise, high-leverage introspection policies, especially for tasks amendable to binary validation, such as function calling or symbolic math.

2.2 Scalar and Multi-Dimensional Reflection Quality

In advanced RLERR, reflection quality is measured by external reward models that judge:

Truthfulness,
Constructiveness,
Specificity,
Substantiveness,
Informational gain.

For instance, (Wang et al., 19 Jan 2026) defines the total reward as

$R_{\text{total}}(q, \tau) = \lambda\,R_{\text{outcome}}(q, \tau) + (1-\lambda)\,R_{\text{reflect}}(\tau),$

where $R_{\text{reflect}}(\tau)$ is a scalar in $[0, 10]$ (sourced from a frozen reward model), and $R_{\text{outcome}}$ is the correctness indicator. This dual-objective reward drives both answer accuracy and high-quality reflection.

In retrieval-augmented LLMs (He et al., 30 Jul 2025), the reflection reward is:

$R^R_i = \begin{cases} +1.0, & \text{if reflection corrects an initial error (wrong $a_1 $, right$ a_2$)}, \ -1.0, & \text{if reflection degrades a correct answer (right $a_1 $, wrong$ a_2$)}, \ 0.0, & \text{otherwise}. \end{cases}$

This reward encourages corrective reflection and penalizes harmful over-reflection.

2.4 Reflection Density and Brevity

In efficiency-oriented RLERR (e.g., REA-RL (Deng et al., 26 May 2025)), reflection reward measures the density of reflective markers in the output. A low reflection density (relative to a nominal quantile from the training distribution) incurs negative reward, penalizing "barebones" completions but avoiding reward for excessive verbosity.

3. RL Objectives and Optimization

All RLERR variants optimize a policy $\pi_\theta$ to maximize the expected sum of total rewards, with reflection rewards incorporated as central terms. The standard optimization procedure in recent works is group-normalized or group-relative policy optimization (GRPO), a variant of PPO:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t} \min\big(r_t(\theta)A', \mathrm{clip}(r_t, 1-\epsilon, 1+\epsilon)A'\big) \right]$

where $A'$ is a group-normalized advantage reflecting the auxiliary (reflection) and main (outcome) rewards.

Auxiliary rewards, typically including reflection- and reasoning-quality signals, are often annealed away late in training so that the final policy is selected for main-task metrics. Difficulty-aware weighting and group-based normalization are employed to stabilize credit assignment, prioritize hard examples, and penalize lack of consistency between reasoning quality and correct final answers (He et al., 30 Jul 2025).

4. Representative Architectures and Implementation Paradigms

RLERR techniques operate across a variety of model and data settings:

Retrieval-Augmented LLMs: TIRESRAG-R1 leverages sufficiency, reasoning quality, and reflection rewards to enhance retrieval-augmented reasoning. It employs Qwen-2.5-3B backbones, BGE-large-en-v1.5 retrievers, group filtering, and PPO with tailored weights and penalties (He et al., 30 Jul 2025).
Self-Reflective LLMs for Symbolic/Verifiable Tasks: RLERR for function calling and math leverages only binary validators, applies reward solely to effective reflection tokens, and trains via on-policy GRPO with KL penalty (Bensal et al., 30 May 2025).
Fully Supervised + RL Hybrid: RLERR often begins with an SFT stage (Self-Critique Fine-Tuning, SCFT) yielding a model with baseline self-critique capabilities; RL then further internalizes and sharpens reflective reasoning (Wang et al., 19 Jan 2026).
Efficient Reasoning Models: REA-RL combines reflection density penalties with parallel sampling and sequential revision, reducing inference cost without degrading accuracy—crucially, the reflection reward is necessary to avoid degenerate policies that minimize length by omitting all reflection (Deng et al., 26 May 2025).
Multimodal LLMs: SRPO adapts the reflection-reward schema to multimodal (text+image) inputs, rewarding effective, concise, and structurally tagged reflections while penalizing redundancy (Wan et al., 2 Jun 2025).
Dynamic Multi-Reward Adjustment: RLERR in dialogue generation can adaptively mix reflection, fluency, and coherence rewards in RL by dynamically optimizing reward weights via non-contextual and contextual multi-armed bandit methods (Min et al., 2024).

5. Empirical Efficacy and Ablation Evidence

Extensive ablations and benchmark evaluations consistently demonstrate the necessity and effectiveness of reflection-focused rewards within RLERR frameworks:

System	Dataset	Drop in EM/Accuracy when Reflection Reward Removed	Source
TIRESRAG-R1	HotpotQA	–3.2 (41.0→37.8)	(He et al., 30 Jul 2025)
TIRESRAG-R1	Musique	–8.6 (19.4→10.8)	(He et al., 30 Jul 2025)
RLERR (math)	AIME2024	–2 (44.2→~42) [outcome-only reward]	(Wang et al., 19 Jan 2026)
REA-RL	GSM8K, etc.	–6–16 ppt (reflection reward critical to avoid collapse)	(Deng et al., 26 May 2025)
RLERR ("Reflect, Retry, Reward")	Math Countdown	+34.7 ppt increase in 2nd-try accuracy (+9 ppt due to RLERR reflection credit)	(Bensal et al., 30 May 2025)

In summary, reflection rewards are critical for error correction, self-correction, and policy stability across a wide spectrum of complex RL for structured and open-domain language reasoning.

6. Key Methodological Variants

Several RLERR instantiations differentiate themselves in approach:

One-Stage vs. Two-Stage Reflection: Explicit two-stage interaction loops (Reflect, Retry, Reward (Bensal et al., 30 May 2025)) assign reward solely to reflection, whereas single-stage methods (SCFT → RLERR (Wang et al., 19 Jan 2026)) incorporate reflection as part of general action selection, with a learned scorer providing graded credit.
Annotation-Free Reflection Credit: RLERR supports fully automatic reflection supervision in any domain where an automated final validation exists, making it broadly scalable.
Reflection-Aware Model Architectures: Some frameworks utilize a separate reflection model (potentially frozen), which parses or annotates reasoning steps for feedback, as in REA-RL retriever-based policies (Deng et al., 26 May 2025) and RLERR for online RM-guided reflection (Wang et al., 19 Jan 2026).

A synopsis of representative RLERR paradigms and core reward forms is summarized below:

Paper	Reflection Reward Formulation	Architectural Feature
(Bensal et al., 30 May 2025)	Token credit for reflection tokens if retry succeeds	2-stage "Reflect, Retry" pipeline
(He et al., 30 Jul 2025)	$\pm1$ reward for corrected/degraded reflection	Retrieval-augmented, GRPO PPO
(Wang et al., 19 Jan 2026)	Scalar RM score for reflection quality	SCFT warmup, PPO GRPO, RM critic
(Wan et al., 2 Jun 2025)	$R_{\text{total}}(q, \tau) = \lambda\,R_{\text{outcome}}(q, \tau) + (1-\lambda)\,R_{\text{reflect}}(\tau),$ 0 for concise/effective improvement	Multimodal, brevity penalty
(Deng et al., 26 May 2025)	Penalty for low reflection density	Sequential/parallel edit architecture

7. Applications, Limitations, and Prospects

RLERR has been applied in multi-hop QA, symbolic math, code generation, multimodal reasoning, and counselor dialogue reflection generation (He et al., 30 Jul 2025, Bensal et al., 30 May 2025, Wan et al., 2 Jun 2025, Wang et al., 19 Jan 2026, Deng et al., 26 May 2025, Min et al., 2024). Reflection reward designs have enabled smaller (1–7B) models to surpass much larger (10–70B) models trained without such credit assignment.

Principal limitations include the requirement for outcome validators or external reflection quality models in reward construction, potential computational overhead in reward modeling, and challenges in settings where successful self-reflection is rare or intractable. Not all reflective behaviors lead to improvement; naive reflection reward terms may incentivize spurious verbosity or irrelevant self-critique unless tightly coupled to solution improvement (Wang et al., 19 Jan 2026).

Future directions include: multi-turn reflection and retry (beyond two steps), task-agnostic reward models, hybrid reward blending (combining reflection and outcome credit), plug-and-play reflection objectives for PPE-based RLHF, and extension to domains with ambiguous or multi-dimensional solution spaces.

RLERR, therefore, represents a principled shift in RL for language reasoning, embedding reflective, self-improving behaviors into the reward design itself, establishing a reproducible path to endow large models with robust, efficient, and reliably self-correcting capabilities.