Papers
Topics
Authors
Recent
Search
2000 character limit reached

Video-R4: Visual Rumination for Video QA

Updated 3 July 2026
  • Video-R4 is a system employing iterative strategies that combine frame selection, spatial zooming, and pixel re-encoding to enhance video question answering.
  • It addresses limitations of single-pass models, reducing hallucination and omissions by emulating human-like attention to transient text evidence.
  • Empirical results demonstrate state-of-the-art performance on benchmarks, with potential for future integration of richer modalities and expanded toolsets.

Video-R4

Video-R4 is a vision-and-language agent designed for text-rich video reasoning, implementing visual rumination—an iterative strategy combining targeted frame selection, localized spatial zoom, recurrent pixel re-encoding, and adaptive context updating. This approach enables robust, evidence-grounded video question answering (VideoQA) in tasks where critical information is embedded in small, transient, or spatially localized textual regions of complex video sequences. The Video-R4 system was developed to address the pronounced limitations of single-pass, non-iterative multimodal systems, notably their tendency toward hallucination and failure in fine-grained reasoning when key evidence is missed or underrepresented in a fixed pre-selected set of frames (Tang et al., 21 Nov 2025).

1. Motivation for Visual Rumination

Text-rich videos—encompassing screen recordings, technical slides, instructional videos, and broadcast news—pose unique challenges for automated understanding. Pertinent facts may be momentarily observable (e.g., error messages, subtitles, labels, GUI elements), and their spatiotemporal locality renders them inaccessible with coarse, uniform frame sampling.

Conventional VideoQA backbones execute a static perception pipeline: select KK frames (often by uniform or heuristic timestamping), encode those with an image-text transformer, and rely on chain-of-thought (CoT) language reasoning that is decoupled from the underlying pixels. In text-rich scenarios, this leads to:

  • Hallucination: Generation of plausible but visually ungrounded answers when the model has not actually seen the relevant content;
  • Omission/Brittleness: Missing short-lived critical evidence cannot be recovered once frames have been passed through the encoder.

Video-R4 emulates human strategies: repeatedly pausing, zooming, re-reading, and updating an internal hypothesis. It operationalizes "visual rumination": a loop over frame and region selection, region re-encoding, and context update conditioned on the current reasoning state (Tang et al., 21 Nov 2025).

2. System Architecture and Operations

The Video-R4 agent is built on a 7B-parameter decoder-only LLM (e.g., Qwen2.5-VL) augmented with an interface to two visual tools—frame clipping and bounding-box cropping. At iteration tt, the model observes:

  • The question qq;
  • Reasoning state sts_t (tokenized dialogue plus all prior retrieved regions);
  • The set of video frames and/or spatially localized crops encoded by a frozen vision encoder fencf_{\text{enc}}.

At each step, Video-R4 selects between two atomic actions:

  1. Frame Selection (Clipping): Sample a subset Ft{1,,N}F_t \subset \{1,\dots,N\} of full-frame indices based on the current context, retrieving those frames at full resolution for encoding.
  2. Spatial Zoom (Cropping): Specify a frame iti_t and bounding box bt=(x,y,w,h)b_t = (x, y, w, h), fetch the designated pixel region, and encode it for further reasoning.

Upon each action, the retrieved region or frame rtr_t is re-encoded and used, together with a natural-language summary of the operation (e.g., "Zoomed into frame 8 at (x,y,w,h)(x, y, w, h)"), to update the model's state: tt0. This process iterates for tt1 steps before answer generation.

This explicit action/state cycle departs from a one-shot perception model in two principal respects: (1) perception is conditioned on the current reasoning state and query; (2) visual tokens may be iteratively updated and are not fixed at the outset.

3. Datasets and Trajectory Curation

Video-R4 was enabled by the construction of two novel datasets:

  • Video-R4-CoT-17k: 17,000 supervised chain-of-thought trajectories. These include atomic (single-tool) and compositional (multi-tool) rumination sequences, each recovering the minimal set of frames and bounding boxes containing decisive evidence. Supervision templates interleave reasoning and tool calls for cropping and clipping and are constructed with fuzzy matching of OCR and object detections, refined by pretrained QA agents and human inspection. The average trajectory involves 3–5 visual operations (Tang et al., 21 Nov 2025).
  • Video-R4-RL-30k: 30,000 reinforcement learning (RL) trajectories focusing on moderate-difficulty, partially evidence-matched questions. These serve as the substrate for RL-based optimization, where evidence completeness is less guaranteed.

All trajectories denote a sequence of tool invocations alongside answer text, resulting in explicit, executable annotation of every reasoning episode.

4. Training Paradigm: SFT and GRPO

Video-R4 employs a four-stage curriculum combining supervised fine-tuning (SFT) and reinforcement learning with group-based relative policy optimization (GRPO).

  1. Deliberate Rumination Practice (DRP-SFT): Trained on atomic trajectories (one tool per sample), optimizing token-level cross-entropy for action prediction.
  2. DRP-Reinforcement (RLtt2): Initializes from DRP, applies GRPO on half the RL dataset with a reward comprising correctness and curiosity factors.
  3. Compositional Rumination Practice (CRP-SFT): Fine-tunes on multi-tool, compositional trajectories.
  4. CRP-Reinforcement (RLtt3): Applies GRPO to the remaining RL data, optimizing for correctness, diversity, representativeness, and curiosity in sampled rollouts.

The reward structure is:

tt4

with tt5 for answer correctness, tt6 for trajectory diversity, tt7 for coverage of salient evidence frames, and tt8 disincentivizing passivity. GRPO computes group-normalized advantages and applies clipped policy optimization with KL regularization against a reference policy, ensuring stable RL convergence.

5. Empirical Results and Qualitative Analysis

Video-R4-7B achieves state-of-the-art performance on M4-ViteVQA, with top accuracy and answer-normalized-levenstein scores (ANLS) compared to prior systems, as summarized:

Task1-Split1 Task1-Split2 Task2
Pixel-Reasoner 52.91 / 61.44 48.88 / 58.23 58.97 / 65.32
Video-R4-7B 56.17 / 65.22 52.69 / 61.89 64.21 / 69.99

Without further tuning, Video-R4 generalizes to MP-DocVQA (53.21% EM, 62.22% ANLS) and SlidesVQA (43.0% EM, 52.2% F1), outperforming baseline multimodal LLMs. On general VideoQA (MVBench, Video-MME, Video-MMMU), it achieves competitive or best-in-class accuracy among 7–8B parameter models (Tang et al., 21 Nov 2025).

Ablation experiments demonstrate that the compositional RL stage (RLtt9) and inclusion of diversity/representativeness rewards are critical—yielding a 2–4% accuracy drop upon removal. Qualitative case studies show that Video-R4's iterative retrieval corrects hallucinations common in single-pass models: e.g., instead of guessing error codes, it locates the precise frame, crops to the dialog, reads the pixel content, and verifies the result by revisiting alternate frames.

6. Limitations and Future Directions

Key limitations identified include:

  • OCR Dependency: The system currently hinges on upstream OCR and object detection quality, constraining evidence mining and bounding accuracy.
  • Restricted Action Set: Only frame clipping and spatial cropping are supported; richer tools (e.g., tracking, audio reasoning, multi-object temporal tracking) remain unexplored.
  • Computational Cost: The iterative rumination loop incurs additional vision encoder calls and elongates inference time relative to single-pass baselines.
  • Model Scale and Domain: Results are reported using a 7B backbone and text-heavy benchmarks; scaling to larger LMMs or domain-generalized tasks is open for future evaluation.

Promising future research avenues include integrating end-to-end OCR fine-tuning, extending the toolset with more expressive spatiotemporal controllers, scaling to larger architectures, and exploring open-ended reward designs for self-supervised visual rumination (Tang et al., 21 Nov 2025).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video-R4.