RISE: Online Self-Verification for RL Models

Updated 21 December 2025

Online Self-Verification (RISE) is a reinforcement learning framework that integrates on-policy solution generation and verifiable self-assessment to enhance model trustworthiness.
It employs a unified RL objective, using deterministic outcome verifiers to jointly optimize answer correctness and self-verification accuracy.
RISE has demonstrated significant improvements in reasoning accuracy and verification metrics across both language and vision tasks.

Online Self-Verification (RISE) is a class of reinforcement learning (RL) frameworks and algorithms that enable neural models—particularly LLMs and Vision-LLMs (VLMs)—to simultaneously improve their problem-solving and output-verification capabilities through integrated, on-policy feedback using verifiable rewards. The defining feature of RISE is the intertwining of solution generation and self-verification within a single online RL process. This design departs from classical self-critique or offline verification approaches by leveraging the same verifiable signals to co-train both roles, producing models that not only solve challenging tasks but also develop robust abilities to self-assess and critique their outputs (Liu et al., 19 May 2025, Weng et al., 2022, Zhang et al., 2 Jun 2025, Hu et al., 17 Aug 2025).

1. Motivation and Theoretical Underpinnings

Recent advances in RL with LLMs have underscored a persistent limitation: models trained solely with outcome-based or scalar rewards frequently exhibit superficial self-reflection, lacking true awareness of their own reasoning validity. This gap is exacerbated when verification is performed via a frozen separate model or only applied in post-hoc, offline scenarios. Traditional self-verification approaches, such as two-stage self-critique or backward verification as in (Weng et al., 2022), offer accuracy improvements by reranking or filtering outputs post-generation, but do not permit the model to adapt its verification mechanisms online or co-adapt generation and critique skills.

RISE addresses these challenges by embedding self-verification as a first-class RL objective. Through on-the-fly, verifiable outcome rewards—often derived from deterministic or rule-based task verifiers—RISE frameworks update the model based on both the correctness of its solutions and the accuracy of its attempted self-assessment. This dual optimization targets not only answer quality but also trustworthiness and transparency in model behavior (Liu et al., 19 May 2025, Zhang et al., 2 Jun 2025).

2. Core Frameworks and Methodological Variants

RISE instantiations differ by domain and architectural choices but share several core ingredients:

Generation Policy: A single LLM or VLM policy $\pi_\theta$ generates solutions (e.g., chain-of-thought for LLMs or annotated reasoning for VLMs).
Self-Verification Head: The same policy, via a specialized prompt, critiques or verifies its own generated output. No additional discriminative parameters are introduced (Zhang et al., 2 Jun 2025, Liu et al., 19 May 2025).
Outcome Verifier (OV): A deterministic, rule-based procedure $OV(x, y) \in \{0,1\}$ or richer reward schema assigns verifiable feedback to both solutions and self-verification responses.
Integrated RL Objective: The expected cumulative reward jointly aggregates solution and verification correctness:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=1}^T r_v(s_t, a_t)\right]$

(Liu et al., 19 May 2025).

Batch Composition and Policy Updates: Solutions and their critiques are processed in the same policy update batch, typically under Proximal Policy Optimization (PPO) or Group-Relative PPO (GRPO) (Liu et al., 19 May 2025, Zhang et al., 2 Jun 2025).
On-Policy Feedback: The OV is called immediately on each trajectory, preventing drift between verification and current model behavior.

Two notable RISE instantiations:

Online RL with Integrated Verification (Liu et al., 19 May 2025): Both generation and verification prompts are batched per RL step; PPO objectives govern both tasks, and the model updates via mixed-trajectory advantages.
Unified Generation–Verification with GRPO (Zhang et al., 2 Jun 2025): A single policy generates a solution and then verifies it with a Yes/No answer, with dynamic verification rewards tuned for group-level difficulty.

3. Detailed Training Algorithms and Objective Functions

A canonical RISE training iteration proceeds as follows (Liu et al., 19 May 2025, Zhang et al., 2 Jun 2025):

Problem Sampling: Draw a batch of problems.
Generation Stage: For each problem $x_i$ , generate multiple solutions $y_{i,k} \sim \pi_\theta$ , compute rewards $r_{i,k} = OV(x_i, y_{i,k})$ .
Verification Stage: For a subset, construct verification prompts $T_\text{ver}(x, y)$ ; sample verification responses $y^{ver}_j \sim \pi_\theta(x^{ver})$ ; assign verification reward $r^{ver} = 1_{y^{ver} = r}$ .
Advantage Computation: Use Generalized Advantage Estimation (GAE) or group-level normalization to compute $\hat A_t$ .
Policy and Critic Updates: Apply clipped PPO or GRPO objectives:

$J_\text{actor}(\theta) = \mathbb{E}_t\left[\min(r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right] - \beta \,\mathrm{KL}(\pi_{\theta_\mathrm{old}} \| \pi_\theta)$

with $r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_\mathrm{old}}(a_t|s_t)$ (Liu et al., 19 May 2025).

Batch Mixing: Both solution and verification experiences are pooled before the policy update.

For mathematical tasks, verification correctness is often defined by exact match or rule satisfaction. In VLMs, the reward signal may encode composite metrics, such as Jensen-Shannon divergence for emotion classification or mean average precision (mAP) for detection tasks, and is further gated by format and leakage constraints (Hu et al., 17 Aug 2025).

4. Domain-Specific Extensions: Vision-LLMs

The RISE paradigm has been extended to VLMs operating on complex image annotation tasks requiring free-form, visually grounded reasoning (Hu et al., 17 Aug 2025). In RISE-CoT (Reasoning-Inspire-Strengthen-Expertise, RISE-Editor's term), the process forms a closed loop:

Annotation–Reasoning–Annotation: Given $(I, A)$ , the model produces a chain-of-thought (CoT) describing its visual reasoning, then reconstructs the original annotation $\hat{A}$ from that CoT.
Self-Supervised Reward: The reward $\mathcal{R}(A, \hat{A}, R)$ is a function of similarity between $A$ and $\hat{A}$ , the absence of answer leakage in the CoT, and the syntactic format of outputs.
Data Augmentation and Two-Stage Training: High-quality CoTs (verified via this closed-loop reward) are selected for supervised and then reinforcement fine-tuning, yielding improved annotation accuracy and explainability compared to standard SFT or reinforcement tuning without verified rationales.

5. Empirical Results and Comparative Performance

Extensive experiments across multiple benchmarks demonstrate consistent improvements in both solution quality and verification capability:

Mathematical Reasoning (LLMs):
- RISE-3B improves reasoning by +3.7% and verification by +33.4 percentage points compared to instruct-tuned or Zero-RL baselines.
- RISE-7B attains 42.9% reasoning accuracy (vs. 11.3% SFT) and 69.2% verification accuracy (vs. 46.6% Zero-RL) (Liu et al., 19 May 2025).
- On MATH500, Self-Verification Qwen-7B achieves 87.2% solution verification rate, outperforming previous LLMs and matching/exceeding GPT-4o and Claude-3.7 on verification accuracy (Zhang et al., 2 Jun 2025).
- Test-time scaling leveraging in-model verification (weighted self-consistency) yields further accuracy gains.
Vision-Language Tasks:
- RISE-trained Qwen2-VL-2B achieves lowest Jensen–Shannon divergence and highest detection mAP on complex annotation tasks (Emotion6, LISA), outperforming SFT, Visual-RFT, and even GPT-4o in specific metrics (Hu et al., 17 Aug 2025).
- High-fidelity self-verified CoTs are critical for downstream accuracy; removal of leakage or format checks in the reward degrades model performance.

Model/Dataset	Reasoning Acc.↑	Verification Acc.↑	Additional Metrics
RISE-7B (MATH500)	42.9%	69.2%	—
Qwen-7B Self-Verify	83.6%	87.2%	F1: 92.8 (MATH500)
Qwen2-VL-2B (Emotion6)	—	—	JSD: 0.071; mAP: 0.404

A plausible implication is that online RL with unified generation and verification support not only performance but model calibration and explainability relative to both post-hoc offline verification and external reward models.

6. Analysis of Self-Verification Behavior and Limitations

RISE-trained models exhibit quantitative and qualitative shifts in self-verification behavior:

The frequency of solutions containing explicit self-verification keywords increases with model size and RISE training, reaching 7–9% for RISE-7B models (Liu et al., 19 May 2025).
Among self-verified solutions, correct answer rates improve, suggesting that explicit verification correlates with model confidence and genuine reasoning.
Case analyses reveal structured, multi-step checks (e.g., divisibility, format validation), contrasting with superficial rule restatement observed in baselines.

Nonetheless, RISE’s effectiveness is currently limited to tasks with deterministic, rule-based verifiable outcomes; extension to open-ended tasks (e.g., code generation, commonsense reasoning) requires new schemas or tool integrations (Zhang et al., 2 Jun 2025, Hu et al., 17 Aug 2025). Overconfidence in incorrect self-verification and the potential need for hybrid (internal+external) verification for critical domains remain open concerns.

7. Relation to Prior and Contemporary Methods

RISE builds upon and subsumes earlier two-stage, backward verification and self-consistency approaches. For example, the self-verification wrapper of (Weng et al., 2022) employs a backward masking strategy to filter candidate solutions, reliably boosting accuracy over chain-of-thought-only baselines. However, by deeply integrating verification into the RL loop and operating entirely online with on-policy data, RISE uniquely enables the co-adaptation of generation and critique, supported by improved empirical gains, efficiency (fewer external calls), and richer introspective behavior (Liu et al., 19 May 2025, Zhang et al., 2 Jun 2025, Hu et al., 17 Aug 2025).

References

(Liu et al., 19 May 2025) Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
(Weng et al., 2022) LLMs are Better Reasoners with Self-Verification
(Zhang et al., 2 Jun 2025) Incentivizing LLMs to Self-Verify Their Answers
(Hu et al., 17 Aug 2025) RISE: Enhancing VLM Image Annotation with Self-Supervised Reasoning