Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

117 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Video-RTS: Efficient Video Reasoning

Updated 10 July 2025

Video-RTS is a framework for efficient video reasoning that uses pure reinforcement learning and adaptive inference to optimize computational resources.
It employs group relative policy optimization and output-based rewards to drastically reduce training data and annotation requirements.
The framework outperforms previous methods on key benchmarks, achieving state-of-the-art accuracy with only 3.6% of the traditional training data.

Video-RTS is a framework for efficient and enhanced video reasoning with LLMs, addressing longstanding challenges in data efficiency and computational scalability for multimodal video understanding. It departs from conventional supervised pipelines through the adoption of pure reinforcement learning driven by output-based rewards and introduces a video-adaptive test-time scaling strategy to better utilize computational resources during inference. Video-RTS has demonstrated state-of-the-art performance on several video reasoning benchmarks while requiring only a small fraction of the training data previously demanded by comparable methods (2507.06485).

1. Motivation and Data-Efficient Video Reasoning

A central challenge in video reasoning with LLMs and reinforcement learning (RL) lies in the heavy dependence on large-scale supervised fine-tuning (SFT) with complex chain-of-thought (CoT) annotations. This process, often involving hundreds of thousands of annotated video-question pairs—for example, Video-R1 uses 165K SFT samples plus additional RL data—incurs high costs, restricts scalability, and presents barriers to long-horizon or domain-specific reasoning. Video-RTS addresses these limitations by forgoing the SFT step and leveraging pure RL on modestly-sized video QA datasets (e.g., 6K samples), leading to comparable or superior video reasoning abilities using just 3.6% of the training data utilized by prior best models. This data efficiency is of practical significance, given the expense of collecting and processing high-dimensional, temporally complex video annotations (2507.06485).

2. Methodological Advances: Pure RL Training and Group Relative Policy Optimization

Video-RTS eliminates the SFT stage entirely, directly applying a data-efficient RL algorithm—Group Relative Policy Optimization (GRPO)—to fine-tune a pretrained multimodal LLM (MLLM) for video reasoning. GRPO operates by generating a group of candidate outputs {O₁,…,O_G} for each input, assigning output-based rewards, and optimizing policy selection based on the relative quality score:

$S_i = \frac{R_i - \text{mean}(\{R_1,…,R_G\})}{\text{std}(\{R_1,…,R_G\})}$

where $R_i$ is the reward for output $O_i$ .

The RL loss is defined as:

$J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\} \sim \pi_{\text{old}}}\left[ \frac{1}{G} \sum_{i=1}^G\left\{ \min\left( \frac{\pi_\theta(o_i|q)}{\pi_{\text{old}}(o_i|q)} S_i,\; \text{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\text{old}}(o_i|q)}, 1-\epsilon, 1+\epsilon\right) S_i \right) \right\} - \beta \cdot D_{\text{KL}}\left(\pi_\theta \| \pi_{\text{ref}}\right) \right]$

Reward construction is solely output-based, combining a format reward $R_{\text{format}}$ for explicit CoT structure (through special tags) and an accuracy reward $R_{\text{acc}}$ for correct answers:

$R(O) = R_{\text{format}}(O) + R_{\text{acc}}(\hat{A} ; A_{\text{gt}})$

where $\hat{A}$ is the model-predicted answer and $A_{\text{gt}}$ the gold answer.

This eliminates the need for intermediate CoT supervision, significantly reducing annotation requirements and enabling generalization across video domains and tasks.

3. Video-Adaptive Test-Time Scaling (TTS) for Efficient Inference

Inference in video reasoning models typically involves dense sampling and uniform frame usage, leading to substantial unnecessary computation for queries that can be answered using limited temporal evidence. Video-RTS introduces a sparse-to-dense TTS strategy: inference begins with sparse frame sampling; multiple model outputs are generated (with varied sampling seeds or decoding temperatures), and answer consensus is checked. If all outputs agree, inference halts; otherwise, additional frames are incorporated and the consensus check iterates, up to a predefined maximum.

Algorithmically, for frame budget $n$ , $m$ outputs are produced, and majority (or unanimous) agreement determines sufficiency. Only difficult queries—where visual information is ambiguous—trigger additional computation. This approach aligns test-time computation with query complexity, yielding both computational savings and accuracy gains (2507.06485).

4. Empirical Results and Performance Metrics

Video-RTS achieves substantial improvements over prior art on a suite of video reasoning benchmarks:

On Video-Holmes, a recent challenging benchmark, Video-RTS surpasses the previous best by 4.2%.
On MMVU, it attains a 2.6% improvement, with similar gains observed on Video-MMMU, Video-MME, and LongVideoBench.
Video-RTS achieves an average 2.4% increase in accuracy across tasks while reducing labeled training requirements to just 3.6% of those used by SFT+RL baselines.

This efficiency is realized both in the reduction of required frames at inference (as a function of achieved output consensus) and in the elimination of costly SFT annotation.

5. Key Innovations and Advantages

Video-RTS's innovations can be summarized as follows:

Elimination of SFT: Pure RL training skips supervised annotation, expediting deployment to new domains and reducing human labor.
Output-Based Rewarding: RL optimization is performed using only output format and final-answer correctness, rather than detailed stepwise thought supervision.
Resource-Adaptive Inference: The sparse-to-dense TTS test-time scaling ensures intensive computation is reserved for hard queries, obviating redundant inference for easy cases.
Data and Compute Efficiency: The architecture demonstrates state-of-the-art reasoning with orders-of-magnitude smaller data and compute footprints.

These advances are particularly relevant for scaling video QA and reasoning systems to domains where annotated resources are sparse, privacy-sensitive, or where rapid task adaptation is critical.

6. Limitations and Future Directions

While Video-RTS achieves notable gains in accuracy and resource use, several future research directions are plausible:

Integrating fairness and bias mitigation, as pretrained MLLMs may exhibit inherited biases from upstream data.
Refining the TTS consensus strategy, possibly through probabilistic self-consistency or confidence-weighted voting, to further balance latency and accuracy.
Extending the pure RL + TTS paradigm to more complex or open-ended multimodal tasks, beyond multiple-choice questions.
Exploring hybrid reward mechanisms that incorporate secondary task-specific metrics or reinforcement signals.

A plausible implication is that similar pure RL + adaptive inference approaches could generalize to audio-visual, multi-document, or multi-turn video reasoning tasks.

7. Impact on the Field and Broader Implications

Video-RTS represents a paradigmatic shift in video reasoning methodology by decoupling high-quality performance from annotation- and compute-intensive SFT pipelines. It enables resource-aware training and inference, expanding accessibility to research groups and applications previously limited by dataset or compute constraints. Its methodological principles—output-based RL, adaptive test-time inference, and data minimization—are likely to inform future work not only in video QA, but in multi-modal reasoning more broadly, as the scale and diversity of video data continue to grow (2507.06485).

PDF Markdown Chat (Upgrade)

References (1)

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning (2025)