Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Published 6 Apr 2026 in cs.CV | (2604.04379v1)

Abstract: Video reasoning has advanced with large multimodal models (LMMs), yet their inference is often a single pass that returns an answer without verifying whether the reasoning is evidence-aligned. We introduce Reinforce to Learn, Elect to Reason (RLER), a dual paradigm that decouples learning to produce evidence from obtaining a reliable answer. In RLER-Training, we optimize the policy with group-relative reinforcement learning (RL) and 3 novel task-driven rewards: Frame-sensitive reward grounds reasoning on explicit key frames, Think-transparency reward shapes readable and parsable reasoning traces, and Anti-repetition reward boosts information density. These signals teach the model to emit structured, machine-checkable evidence and potentiate reasoning capabilities. In RLER-Inference, we apply a train-free orchestrator that generates a small set of diverse candidates, parses their answers and cited frames, scores them by evidence consistency, confidence, transparency, and non-redundancy, and then performs a robust evidence-weighted election. This closes the loop between producing and using evidence, improving reliability and interpretability without enlarging the model. We comprehensively evaluate RLER against various open-source and RL-based LMMs on 8 representative benchmarks. RLER achieves state of the art across all benchmarks and delivers an average improvement of 6.3\% over base models, while using on average 3.1 candidates per question, indicating a favorable balance between compute and quality. The results support a simple thesis: making evidence explicit during learning and electing by evidence during inference is a robust path to trustworthy video reasoning.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces RLER, a dual paradigm for video reasoning that decouples evidence reinforcement during training from evidence-based arbitration during inference.
It leverages Group Relative Policy Optimization with tailored rewards to produce precise, machine-readable evidence from video frames.
Experimental results show state-of-the-art performance, confirming improved robustness and interpretability in complex multimodal tasks.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

Motivation and Problem Formulation

Video reasoning with Large Multimodal Models (LMMs) has been impeded by single-pass inference strategies that neglect systematic validation of reasoning chains and lack evidential grounding. These limitations are exacerbated by the long temporal dependencies and inherent noise of video data, which magnify the risks of spurious correlations and non-robust inferences. Even state-of-the-art LMMs, including recent RL-finetuned models, typically generate an answer directly without verifying cross-frame evidence alignment. This paradigm not only complicates error detection but also impairs reliability and interpretability in contexts that require transparent, verifiable, and evidence-centric reasoning.

RLER: Dual Evidence-Centric Training and Inference

"Reinforce to Learn, Elect to Reason" (RLER) (2604.04379) introduces a dual paradigm explicitly decoupling (1) the reinforcement of evidence emission during training and (2) evidence-aligned arbitration during inference. The goal is to shape the model to produce machine-readable, structured evidence and subsequently reason by aggregating and verifying this evidence—closing the loop in a way that enhances both reliability and interpretability.

Figure 1: RLER inference contrasts with traditional single-pass inference by generating structured outputs, scoring cross-candidate evidence, aggregating by evidence weight, and performing a refutation check to return credible answers.

RLER-Training: Reward Shaping for Evidence-Centric Outputs

Training leverages Group Relative Policy Optimization (GRPO) with three core task-driven reward signals:

Frame-sensitive Reward: Maximizes correct citation of key video frames within reasoning traces to enhance cross-frame and temporal grounding.
Think-transparency Reward: Favors readable and parsable chains-of-thought, optimizing for interpretable and non-verbose rationales with a robust length normalization.
Anti-repetition Reward: Penalizes $n$ -gram repetition, increasing informational density and discouraging vacuous, low-information patterns.

These are aggregated with classic format and accuracy rewards to shape the policy toward structured, evidence-rich, and schema-compliant outputs.

Figure 2: RLER-Training employs GRPO with novel rewards for structured evidence and explicit keyframe citation; RLER-Inference orchestrates diverse candidates, parses structure, scores evidence, aggregates, and performs refutation.

RLER-Inference: Evidence-Aligned Orchestration and Election

Inference employs a train-free orchestrator that models decision-making as an evidence-weighted election among a small, diverse set of structured candidates:

Diversified Sampling: Candidates are generated via diverse input perturbations (e.g., decoding temperature, visual cropping).
Evidence Parsing: Each candidate is parsed to extract answer, cited keyframes, reasoning trace, and a confidence proxy.
Evidence Scoring: Each candidate receives a composite score reflecting frame-grounding, transparency, non-redundancy, and confidence. This mirrors the reward structuring in RLER-Training.
Weighted Aggregation: A robust election mechanism (outlier removal, evidence-weighted scoring) selects the answer supported by the most consistent evidence. Early stopping is dynamically triggered when consensus is robust.
Referee-style Self-Check: A final, automatic critique verifies sufficiency of frame citations; if counter-evidence is detected, reweighting and resampling are triggered.

This process integrates all forms of explicit evidence, assigning maximal influence to candidates whose reasoning aligns with cross-candidate, cross-frame evidence.

Figure 3: A case study highlights how RLER forms structured candidates from diverse inputs, scores and aggregates evidence, and verifies/refutes to revise and finalize the answer.

Emergent Behaviors and "Aha Moment"

RLER-Training induces self-correction behaviors, notable for spontaneous "aha moments" where the model internally detects inconsistencies, interrupts its chain-of-thought, and restarts reasoning. These emergent self-tests are leveraged at inference for targeted self-review, further enhancing reliability and enabling principled corrections in ambiguous or error-prone cases.

Figure 4: RLER-Training elicits emergence—the model engages in self-review and explicit correction, exemplified by internally identifying contradictions and recalibrating its hypothesis.

Experimental Results

Empirical evaluation on 8 public video reasoning benchmarks demonstrates consistent state-of-the-art performance among open-source LMMs:

Video Reasoning: Achieves 43.3% on VSIBench and 54.2% on VideoMMMU.
General Understanding: Notable 68.5% on VideoMME (without subtitles), 76.2% on TempCompass, and 72.9% on MVBench.
Long Video Understanding: 50.7% on LVBench and 63.0% on LongVideoBench.

RLER's design delivers an average +6.3% improvement relative to strong LMM baselines, with a modest computational footprint ( $\bar{K} = 3.1$ candidates per question), outpacing both single-sample and static multi-sample paradigms. Ablations confirm that the absence of any core reward or inference module yields measurable accuracy degradation and impairs evidence alignment.

Theoretical and Practical Implications

This evidence-aligned dual paradigm challenges the conventional end-to-end or single-chain-of-thought frameworks, suggesting that decoupling evidence production from arbitration leads to higher trustworthiness and robustness in temporally complex, multi-view reasoning domains. The RLER approach enables principled integration of diverse rationales while guaranteeing that elected answers are both interpretable and verifiable with explicit evidence trails.

The implicit link between reward shaping and arbitration further suggests extensions to other high-noise, high-complexity multimodal tasks such as scientific/medical video analysis, surveillance summarization, and interactive agent systems. The paradigm's flexibility with a given LMM backbone and its decoupled inference orchestration facilitate adoption in systems constrained by compute or requiring deployment-time adjustability.

Future Directions

Potential future work includes scaling RLER to mixed-modal (video, audio, text) tasks, integrating continual learning for evolving long-horizon contexts, and formalizing the detection and handling of emergent self-correction patterns. The self-contained, schema-guided nature of training and inference also enables straightforward integration into federated or privacy-preserving reasoning systems.

Conclusion

RLER presents a principled, evidence-centric dual-paradigm for video reasoning. By explicitly shaping structured, machine-checkable evidence during training and aggregating by that evidence at inference, RLER substantially enhances the robustness, reliability, and interpretability of LMM-based video understanding. The observed empirical improvements, efficient compute–quality tradeoff, and emergence of self-correction behaviors consolidate RLER as a technical advancement for trustworthy multimodal reasoning.

Markdown Report Issue