Papers
Topics
Authors
Recent
2000 character limit reached

ViSS-R1: Visual Self-Supervised Reinforcement

Updated 24 November 2025
  • ViSS-R1 is a framework that enhances video reasoning by integrating self-supervised reinforcement learning with visual transformation tasks.
  • It employs Pretext-GRPO to force models to process visual inputs non-trivially, bridging pretext tasks with supervised and RL stages.
  • The unified SFT and RL pipeline results in robust performance, showing improved generalization and resilience against visual perturbations.

ViSS-R1 (Visual Self-Supervised Reinforcement for R1 Tasks) is a framework for robust, visual-centric video reasoning with Multimodal LLMs (MLLMs). It addresses the limitations of traditional R1-based pipelines, which often prioritize text-centric cues and are susceptible to shortcut learning and hallucination by insufficiently leveraging spatio-temporal information in videos. By integrating a self-supervised reinforcement learning algorithm, Pretext-GRPO, and fusing pretext identification tasks into both supervised fine-tuning and reinforcement phases, ViSS-R1 compels models to non-trivially process visual stimuli and demonstrate explicit transformation reasoning. The result is improved video understanding and generalization to transformed or perturbed inputs (Fang et al., 17 Nov 2025).

1. Architectural Foundation and Motivation

The prevailing R1 paradigm for MLLM video reasoning comprises two sequential stages: Supervised Fine-Tuning (SFT) on chain-of-thought (CoT) annotated video–question pairs, followed by RL-based post-training (e.g., PPO, GRPO) on genuine user queries with sparse, typically text-based, reward signals. In this context, models are prone to overfit on weak visual signals, exploiting statistical biases rather than extracting meaningful temporal or spatial patterns. This often results in shortcut learning (e.g., reliance on single frames) and hallucination.

ViSS-R1 introduces two core advances:

  • Pretext-GRPO: An intermediate self-supervised RL stage where the model receives synthetic visual transformations and must identify them via multi-choice questions (MCQs), warming up visual representations.
  • Fully Integrated ViSS-R1: A unified training protocol where both pretext and user queries are handled simultaneously during SFT and RL, obligating the model to parse and reason over transformed inputs and provide explicitly tagged outputs—including identified transformations, reconstruction CoT, and final answers.

This dual mechanism reinforces visual-centric reasoning by requiring the model to understand and invert transformations as a prerequisite to task success (Fang et al., 17 Nov 2025).

2. Pretext-GRPO Algorithmic Structure

Pretext Task Construction

Pretext-GRPO leverages task families suited to both images and videos:

  • Image-based Tasks:
    • Rotation (0°, 90°, 180°, 270°) — 4-way MCQ
    • Flip (none, horizontal, vertical) — 3-way MCQ
    • Puzzle (swap two of 4 patches) — 6-way MCQ
  • Video-based Tasks:
    • 3D-Rotation — 4-way MCQ
    • Reverse (forward/backward) — 2-way MCQ
    • Shuffle (swap two out of four temporal clips) — 6-way MCQ

For each sample, the model is prompted with a transformed input Tr(V)Tr(V) and a corresponding multiple-choice pretext query QpQ_p. The policy πθ\pi_\theta must select the correct transformation.

Reinforcement Objective

Each sample receives a binary reward:

$r^p = \begin{cases} 1, & \text{if $o$ = ground-truth transformation} \ 0, & \text{otherwise} \end{cases}$

The RL objective is to maximize the expected pretext reward EV,Qp,πθ[Rp]E_{V,Q_p,\pi_\theta}[R^p].

GRPO Operationalization

As a variant of PPO, GRPO dispenses with a value network and optimizes:

JGRPO(θ)=Eq,{oi}[1Gi=1Gmin(ρiAi,clip(ρi,1ϵ,1+ϵ)Ai)βDKL(πθπref)]\mathcal{J}_{GRPO}(\theta) = E_{q,\{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \min(\rho_i A_i, \operatorname{clip}(\rho_i, 1-\epsilon, 1+\epsilon)A_i) - \beta D_{KL}(\pi_\theta \|\pi_{ref}) \right]

where groupwise sampling, clipped ratios, and advantage normalization are maintained. For Pretext-GRPO, (Tr(V),Qp)(Tr(V),Q_p) replaces the "real" question, and the reward is the binary pretext reward. This stage functions as a "warm-up," facilitating visual-centric self-supervision prior to user-oriented RL (Fang et al., 17 Nov 2025).

3. Unified Training and Mathematical Formalism

SFT Stage

Supervised learning uses teacher-forced outputs with explicit transformational tagging:

LSFT=E(Tr(V),Qp,Q),o[tlogπθ(oto<t,Tr(V),Qp,Q)]\mathcal{L}_{SFT} = - E_{(Tr(V), Q_p, Q), o^*} \left[ \sum_t \log \pi_\theta(o^*_t \mid o^*_{<t}, Tr(V), Q_p, Q) \right]

where oo^* represents the teacher-generated chains-of-thought, explicitly tagged with transformation, reasoning, and answer spans.

RL Stage

During RL, outputs are decomposed into transformation prediction (oito_i^t) and task answer (oiao_i^a). The importance weight extends as:

ρi=πθ(oit,oiaTr(V),Qp,Q)πθold(oit,oiaTr(V),Qp,Q)\rho_i = \frac{\pi_\theta(o_i^t, o_i^a \mid Tr(V), Q_p, Q)}{\pi_{\theta_{old}}(o_i^t, o_i^a \mid Tr(V), Q_p, Q)}

Reward components per sample:

  • RtR_t: Transformation accuracy, $0.5$ if correct, $0$ otherwise.
  • RaR_a: Task-specific scoring (e.g., Exact Match, ROUGE, WER).
  • RfR_f: Formatting bonus, $1$ if all requisite output tags are present, $0$ otherwise.

Total reward:

R=Rt+Ra+RfR = R_t + R_a + R_f

Final objective:

JViSSR1(θ)=E[1Gi=1Gmin(ρiAi,clip(ρi,1ϵ,1+ϵ)Ai)βDKL(πθπref)]\mathcal{J}_{ViSS-R1}(\theta) = E \left[ \frac{1}{G} \sum_{i=1}^G \min(\rho_i A_i, \operatorname{clip}(\rho_i, 1-\epsilon, 1+\epsilon) A_i) - \beta D_{KL}(\pi_\theta \|\pi_{ref}) \right]

with Ai=(RiμR)/σRA_i = (R_i - \mu_R)/\sigma_R.

This design mandates simultaneous and explicit optimization of transformation identification, answer correctness, and output format, tying visual feature acquisition directly to downstream performance.

4. Mechanisms of Visual-Centric Reasoning

During both SFT and RL phases, the model receives a transformed video, the associated pretext MCQ, and the true user query. The required output is a structured sequence:

  • <transform>...</transform>: Predicted transformation label
  • > ...: CoT reflecting attempted reconstruction (e.g., original order/spatial layout)
  • <answer>...</answer>: User-query answer

This structure enforces deep parsing of spatiotemporal information, as the model is evaluated not only on the user query but also on its ability to identify and invert synthetic visual perturbations. The architecture thereby encourages generalization: At inference, QpQ_p is omitted, but the decoder leverages its visual representations for raw (untransformed) videos.

A plausible implication is that by jointly optimizing transformation recognition and downstream question answering within a single encoder-decoder network, ViSS-R1 achieves more robust, generalizable visual reasoning capabilities than methods focused solely on text- or image-based pretraining.

5. Training Protocol and Implementation

Key training details include:

  • Base model: Qwen2.5-VL-7B
  • Pretext-GRPO warm-up: 500 RL steps (G4G\geq 4 candidates/group)
  • Full RL (Pretext-GRPO+ for ablation): 1,000 steps post-warmup
  • Integrated ViSS-R1:
    • Re-prompt 72B teacher for tagged CoTs on Video-R1 CoT-165k and Video-R1-260k datasets
    • SFT with standard maximum likelihood on tagged CoT sequences, epochs until convergence
    • RL for 1,000 steps using R=Rt+Ra+RfR = R_t + R_a + R_f, with β0.1\beta \approx 0.1, ϵ=0.2\epsilon = 0.2
  • Frames: 32 per video, resolution 256×28×28256\times 28\times 28 cap
  • Hardware: 8 × NVIDIA A800 (80 GB); implementation atop Open-R1
  • Inference: topp=0.001top_p=0.001, temperature=0.01temperature=0.01

The explicit SFT and RL schedule, with fine-tuned reward scaling (Rt=0.5R_t = 0.5), ensures balanced emphasis between pretext skill and user-query accuracy.

6. Experimental Evidence and Robustness

The empirical evaluation spans six video reasoning/understanding benchmarks, with results summarized below:

Model VSI-Bench VideoMMMU MMVU MVBench Temp. Compass VideoMME
Qwen2.5-VL-7B 30.1 48.1 60.0 59.0 72.6 56.6
Video-R1 35.8 52.3 63.8 63.9 73.2 59.3
VideoRFT 36.8 51.1 68.5 62.1 73.7 59.8
Pretext-GRPO+ 39.2 53.9 65.3 66.0 73.9 60.1
ViSS-R1 37.3 51.7 66.1 65.6 75.3 60.5

Key findings:

  • Pretext-GRPO+ achieves new SOTA on VSI-Bench (+3.4) and VideoMMMU (+1.6) relative to Video-R1.
  • Integrated ViSS-R1 further improves on temporal benchmarks (TempCompass +2.1), highlighting its enhanced temporal reasoning.
  • Video-only pretext tasks yield richer spatio-temporal features compared to image-only.
  • The SFT→RL sequence (ViSS-R1 proper) outperforms SFT-only (susceptible to overfitting) and RL-only (lacks structural formatting).
  • The model shows high robustness; on augmented inputs (e.g., Rotation, Reversal, Shuffle), prior models degrade −3 to −6 points, while ViSS-R1 loses only 1–2 points.
  • Qualitatively, ViSS-R1 outputs visual-centric CoT: first assessing global video structure (object layouts, order), then reconstructing transformations before answering, mirroring human multistep reasoning.

7. Context Within R1-Based and Spatial Reasoning Frameworks

ViSS-R1 extends the R1 methodology originally seen in language modeling and adapted to MLLM settings. Unlike contemporary approaches (e.g., SVQA-R1 (Wang et al., 2 Jun 2025)), which emphasize view-consistent reward optimization for static images using spatial perturbations (i.e., mirroring, and view-consistent semantic rewards), ViSS-R1's novelty lies in self-supervised video transformation tasks within the RL loop, directly binding vision-centric skills to multi-stage reasoning and user-task performance.

A plausible implication is that incorporating explicit pretext transformation handling at both SFT and RL stages, as in ViSS-R1, offers a general strategy for improving multimodal model robustness—not only for video but also potentially for other dynamic modalities.


ViSS-R1 advances video-centric multimodal reasoning by systematically embedding visual transformation identification and reconstruction into both supervised and reinforcement learning pipelines. It compels MLLMs to move beyond text-centric shortcut strategies, yielding superior empirical results and resilience to input perturbations through rigorous, vision-grounded chain-of-thought reasoning (Fang et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ViSS-R1 Framework.