Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 107 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Visual Jigsaw Post-Training Improves MLLMs (2509.25190v1)

Published 29 Sep 2025 in cs.CV

Abstract: Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal LLMs (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/

Summary

The paper introduces Visual Jigsaw, a self-supervised post-training method that enhances fine-grained perception in image, video, and 3D modalities.
It employs a jigsaw ordering task with graded rewards to improve spatial, temporal, and depth reasoning without requiring architectural changes.
Empirical results demonstrate notable improvements, including up to +17.11 points on depth ordering tasks and significant gains in both temporal and image benchmarks.

Visual Jigsaw Post-Training for Enhanced Multimodal LLM Perception

Motivation and Context

The paper introduces Visual Jigsaw, a self-supervised post-training framework designed to improve the intrinsic visual understanding capabilities of multimodal LLMs (MLLMs). While recent advances in MLLMs have focused on text-centric reasoning and alignment via reinforcement learning from verifiable rewards (RLVR), the direct enhancement of vision-centric perception has been underexplored. Existing approaches often require architectural modifications or additional generative modules, which complicate integration and scalability. Visual Jigsaw addresses this gap by formulating a lightweight, verifiable ordering task that leverages only the model's existing text output format, enabling seamless post-training across image, video, and 3D modalities.

Figure 1: Visual Jigsaw post-training substantially strengthens fine-grained perception, spatial, compositional, temporal, and geometry-aware understanding across modalities, as shown in radar chart evaluations.

Methodology

Task Formulation

Visual Jigsaw is instantiated as a general ordering problem. For each modality:

Image Jigsaw: An image is partitioned into a $3 \times 3$ grid of non-overlapping patches, shuffled, and the model must predict the correct raster order.
Video Jigsaw: A video is segmented into six temporal clips, shuffled, and the model reconstructs the original chronological order.
3D Jigsaw: Six points with distinct depth values are sampled from an RGB-D image, shuffled and annotated, and the model predicts the correct depth order from nearest to farthest.

The model outputs a permutation of indices in natural language, which is compared against the ground truth. A graded reward function is used: exact matches receive full reward, partial matches are scaled by a discount factor ( $\gamma=0.2$ ), and invalid permutations receive zero reward. Outputs must adhere to a strict format, with reasoning enclosed in > tags and answers in <answer> tags.

Figure 2: Illustration of Visual Jigsaw tasks for images, videos, and 3D data, showing partitioning, shuffling, and the ordering objective.

Training Protocol

Post-training is performed using Group Relative Policy Optimization (GRPO) without KL or entropy regularization. The base model is Qwen2.5-VL-7B-Instruct. Training datasets include 118K COCO images, 100K LLaVA-Video clips, and 300K ScanNet RGB-D samples. Batch sizes and learning rates are modality-specific, and multiple responses are sampled per prompt to encourage exploration.

Empirical Results
Image Jigsaw

Visual Jigsaw post-training yields consistent improvements across fine-grained perception, monocular spatial understanding, and compositional visual understanding benchmarks. Gains over strong baselines (ThinkLite-VL, VL-Cogito, LLaVA-Critic-R1) are observed in MMVP, MMStar, HR-Bench, VSR, Winoground, and others, with improvements up to +6.06 points on fine-grained tasks and +5.90 on spatial reasoning.
Figure 3: Examples of the image jigsaw task, showing shuffled patches and ground-truth raster order.

Video Jigsaw

On video understanding benchmarks (AoTBench, Vinoground, TOMATO, FAVOR-Bench, TUNA-Bench, Video-MME, TempCompass, TVBench, MotionBench, LVBench, VSI-Bench, Video-TT, CVBench), Video Jigsaw consistently outperforms both RL and SFT baselines across all frame settings. Gains are most pronounced on temporal-centric tasks, e.g., AoTBench (+6.15), and cross-video reasoning (CVBench, +3.00).
Figure 4: Example of the video jigsaw task, with shuffled clips and ground-truth chronological order.

3D Jigsaw

3D Jigsaw post-training leads to substantial improvements on SAT-Real, 3DSRBench, ViewSpatial, All-Angles, OminiSpatial, VSI-Bench, SPARBench, and DA-2K. The largest gain is on DA-2K (+17.11), directly related to depth ordering, but improvements are also observed on single-view, multi-view, and egocentric video tasks, indicating generalization of 3D spatial reasoning.
Figure 5: Examples of the 3D jigsaw task, requiring depth ordering of annotated points.

Qualitative Analysis

Qualitative examples demonstrate that models trained with Visual Jigsaw exhibit improved attention to local details, global spatial layouts, and inter-element relationships in images, videos, and 3D scenes.
Figure 6: Qualitative examples on image tasks, showing enhanced fine-grained perception.

Figure 7: Qualitative examples on video tasks, illustrating improved temporal reasoning.

Figure 8: Qualitative examples on 3D tasks, highlighting geometry-aware understanding.

Ablation Studies

SFT vs. RL: RL post-training generalizes better than SFT, which can lead to overfitting and degraded transfer on certain benchmarks.

Task Difficulty: Increasing jigsaw complexity (e.g., $3 \times 3$ vs. $2 \times 2$ grids) provides stronger supervision and larger performance gains. Partial accuracy rewards are critical for learning in difficult setups.

Reasoning Models: Visual Jigsaw post-training on reasoning-oriented MLLMs (e.g., ThinkLite-VL) improves visual perception without sacrificing reasoning ability.

3D Variants: Alternative 3D jigsaw tasks (view-motion matching, BEV-pose matching) do not outperform depth ordering, likely due to limited 3D priors in current MLLMs.

Implementation Considerations

Computational Requirements: The approach is lightweight, requiring no architectural changes or generative modules, and is compatible with text-only MLLMs.

Scalability: While effective at moderate scale, further scaling of data and model size is a promising direction.

Generalization: RL-based post-training with verifiable, self-supervised objectives enables robust transfer to diverse downstream tasks.

Limitations: The method relies on the model's existing visual priors; more complex jigsaw tasks or richer 3D representations may require stronger base models.

Implications and Future Directions

Visual Jigsaw demonstrates that self-supervised, vision-centric post-training can substantially enhance the perceptual grounding of MLLMs across modalities. The approach is general, efficient, and verifiable, making it suitable for large-scale deployment. Future work should explore more complex jigsaw configurations, hybrid spatial-temporal tasks, and integration with models possessing richer 3D priors. The paradigm also motivates broader investigation into self- and weakly-supervised objectives for multimodal model development.

Conclusion

Visual Jigsaw provides a principled, scalable framework for improving vision-centric perception and understanding in MLLMs via self-supervised RL post-training. The method achieves consistent gains across image, video, and 3D modalities, without requiring architectural modifications or generative components. These results establish vision-centric self-supervised tasks as a complementary and effective strategy for advancing multimodal model capabilities.