ViSS-R1: Visual Self-Supervised Reinforcement
- ViSS-R1 is a framework that enhances video reasoning by integrating self-supervised reinforcement learning with visual transformation tasks.
- It employs Pretext-GRPO to force models to process visual inputs non-trivially, bridging pretext tasks with supervised and RL stages.
- The unified SFT and RL pipeline results in robust performance, showing improved generalization and resilience against visual perturbations.
ViSS-R1 (Visual Self-Supervised Reinforcement for R1 Tasks) is a framework for robust, visual-centric video reasoning with Multimodal LLMs (MLLMs). It addresses the limitations of traditional R1-based pipelines, which often prioritize text-centric cues and are susceptible to shortcut learning and hallucination by insufficiently leveraging spatio-temporal information in videos. By integrating a self-supervised reinforcement learning algorithm, Pretext-GRPO, and fusing pretext identification tasks into both supervised fine-tuning and reinforcement phases, ViSS-R1 compels models to non-trivially process visual stimuli and demonstrate explicit transformation reasoning. The result is improved video understanding and generalization to transformed or perturbed inputs (Fang et al., 17 Nov 2025).
1. Architectural Foundation and Motivation
The prevailing R1 paradigm for MLLM video reasoning comprises two sequential stages: Supervised Fine-Tuning (SFT) on chain-of-thought (CoT) annotated video–question pairs, followed by RL-based post-training (e.g., PPO, GRPO) on genuine user queries with sparse, typically text-based, reward signals. In this context, models are prone to overfit on weak visual signals, exploiting statistical biases rather than extracting meaningful temporal or spatial patterns. This often results in shortcut learning (e.g., reliance on single frames) and hallucination.
ViSS-R1 introduces two core advances:
- Pretext-GRPO: An intermediate self-supervised RL stage where the model receives synthetic visual transformations and must identify them via multi-choice questions (MCQs), warming up visual representations.
- Fully Integrated ViSS-R1: A unified training protocol where both pretext and user queries are handled simultaneously during SFT and RL, obligating the model to parse and reason over transformed inputs and provide explicitly tagged outputs—including identified transformations, reconstruction CoT, and final answers.
This dual mechanism reinforces visual-centric reasoning by requiring the model to understand and invert transformations as a prerequisite to task success (Fang et al., 17 Nov 2025).
2. Pretext-GRPO Algorithmic Structure
Pretext Task Construction
Pretext-GRPO leverages task families suited to both images and videos:
- Image-based Tasks:
- Rotation (0°, 90°, 180°, 270°) — 4-way MCQ
- Flip (none, horizontal, vertical) — 3-way MCQ
- Puzzle (swap two of 4 patches) — 6-way MCQ
- Video-based Tasks:
- 3D-Rotation — 4-way MCQ
- Reverse (forward/backward) — 2-way MCQ
- Shuffle (swap two out of four temporal clips) — 6-way MCQ
For each sample, the model is prompted with a transformed input and a corresponding multiple-choice pretext query . The policy must select the correct transformation.
Reinforcement Objective
Each sample receives a binary reward:
$r^p = \begin{cases} 1, & \text{if $o$ = ground-truth transformation} \ 0, & \text{otherwise} \end{cases}$
The RL objective is to maximize the expected pretext reward .
GRPO Operationalization
As a variant of PPO, GRPO dispenses with a value network and optimizes:
where groupwise sampling, clipped ratios, and advantage normalization are maintained. For Pretext-GRPO, replaces the "real" question, and the reward is the binary pretext reward. This stage functions as a "warm-up," facilitating visual-centric self-supervision prior to user-oriented RL (Fang et al., 17 Nov 2025).
3. Unified Training and Mathematical Formalism
SFT Stage
Supervised learning uses teacher-forced outputs with explicit transformational tagging:
where represents the teacher-generated chains-of-thought, explicitly tagged with transformation, reasoning, and answer spans.
RL Stage
During RL, outputs are decomposed into transformation prediction () and task answer (). The importance weight extends as:
Reward components per sample:
- : Transformation accuracy, $0.5$ if correct, $0$ otherwise.
- : Task-specific scoring (e.g., Exact Match, ROUGE, WER).
- : Formatting bonus, $1$ if all requisite output tags are present, $0$ otherwise.
Total reward:
Final objective:
with .
This design mandates simultaneous and explicit optimization of transformation identification, answer correctness, and output format, tying visual feature acquisition directly to downstream performance.
4. Mechanisms of Visual-Centric Reasoning
During both SFT and RL phases, the model receives a transformed video, the associated pretext MCQ, and the true user query. The required output is a structured sequence:
<transform>...</transform>: Predicted transformation label> ...: CoT reflecting attempted reconstruction (e.g., original order/spatial layout)<answer>...</answer>: User-query answer
This structure enforces deep parsing of spatiotemporal information, as the model is evaluated not only on the user query but also on its ability to identify and invert synthetic visual perturbations. The architecture thereby encourages generalization: At inference, is omitted, but the decoder leverages its visual representations for raw (untransformed) videos.
A plausible implication is that by jointly optimizing transformation recognition and downstream question answering within a single encoder-decoder network, ViSS-R1 achieves more robust, generalizable visual reasoning capabilities than methods focused solely on text- or image-based pretraining.
5. Training Protocol and Implementation
Key training details include:
- Base model: Qwen2.5-VL-7B
- Pretext-GRPO warm-up: 500 RL steps ( candidates/group)
- Full RL (Pretext-GRPO+ for ablation): 1,000 steps post-warmup
- Integrated ViSS-R1:
- Frames: 32 per video, resolution cap
- Hardware: 8 × NVIDIA A800 (80 GB); implementation atop Open-R1
- Inference: ,
The explicit SFT and RL schedule, with fine-tuned reward scaling (), ensures balanced emphasis between pretext skill and user-query accuracy.
6. Experimental Evidence and Robustness
The empirical evaluation spans six video reasoning/understanding benchmarks, with results summarized below:
| Model | VSI-Bench | VideoMMMU | MMVU | MVBench | Temp. Compass | VideoMME |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 30.1 | 48.1 | 60.0 | 59.0 | 72.6 | 56.6 |
| Video-R1 | 35.8 | 52.3 | 63.8 | 63.9 | 73.2 | 59.3 |
| VideoRFT | 36.8 | 51.1 | 68.5 | 62.1 | 73.7 | 59.8 |
| Pretext-GRPO+ | 39.2 | 53.9 | 65.3 | 66.0 | 73.9 | 60.1 |
| ViSS-R1 | 37.3 | 51.7 | 66.1 | 65.6 | 75.3 | 60.5 |
Key findings:
- Pretext-GRPO+ achieves new SOTA on VSI-Bench (+3.4) and VideoMMMU (+1.6) relative to Video-R1.
- Integrated ViSS-R1 further improves on temporal benchmarks (TempCompass +2.1), highlighting its enhanced temporal reasoning.
- Video-only pretext tasks yield richer spatio-temporal features compared to image-only.
- The SFT→RL sequence (ViSS-R1 proper) outperforms SFT-only (susceptible to overfitting) and RL-only (lacks structural formatting).
- The model shows high robustness; on augmented inputs (e.g., Rotation, Reversal, Shuffle), prior models degrade −3 to −6 points, while ViSS-R1 loses only 1–2 points.
- Qualitatively, ViSS-R1 outputs visual-centric CoT: first assessing global video structure (object layouts, order), then reconstructing transformations before answering, mirroring human multistep reasoning.
7. Context Within R1-Based and Spatial Reasoning Frameworks
ViSS-R1 extends the R1 methodology originally seen in language modeling and adapted to MLLM settings. Unlike contemporary approaches (e.g., SVQA-R1 (Wang et al., 2 Jun 2025)), which emphasize view-consistent reward optimization for static images using spatial perturbations (i.e., mirroring, and view-consistent semantic rewards), ViSS-R1's novelty lies in self-supervised video transformation tasks within the RL loop, directly binding vision-centric skills to multi-stage reasoning and user-task performance.
A plausible implication is that incorporating explicit pretext transformation handling at both SFT and RL stages, as in ViSS-R1, offers a general strategy for improving multimodal model robustness—not only for video but also potentially for other dynamic modalities.
ViSS-R1 advances video-centric multimodal reasoning by systematically embedding visual transformation identification and reconstruction into both supervised and reinforcement learning pipelines. It compels MLLMs to move beyond text-centric shortcut strategies, yielding superior empirical results and resilience to input perturbations through rigorous, vision-grounded chain-of-thought reasoning (Fang et al., 17 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free