Spatial-SSRL: Self-Supervised Spatial RL
- Spatial-SSRL is a self-supervised reinforcement learning paradigm that replaces costly annotations with verifiable spatial pretext tasks from RGB or RGB-D images.
- It employs a brief supervised fine-tuning phase followed by Group Relative Policy Optimization to yield deterministic rewards based on reproducible spatial reasoning prompts.
- Empirical results reveal significant accuracy improvements across 2D and 3D tasks, highlighting its scalable and domain-general approach for LVLM training.
Spatial-SSRL is a self-supervised reinforcement learning paradigm developed to enhance the spatial reasoning abilities of large vision-LLMs (LVLMs). It replaces costly annotated supervision and restricted RLVR tools with verifiable signals derived directly from RGB or RGB-D images, making spatially grounded training scalable and domain-general. This framework automatically generates five intrinsically verifiable spatial pretext tasks for RL optimization, enabling systematic improvement of both two-dimensional and three-dimensional spatial understanding.
1. Methodological Foundations
Spatial-SSRL defines a pipeline in which self-supervised learning pretext tasks are reformulated so their solutions are algorithmically verifiable. These pretext tasks provide instant ground-truth labels without human or model-based annotation, circumventing bottlenecks of prior RLVR paradigms.
The training protocol consists first of a short supervised fine-tuning (SFT) “cold-start” phase (using a small fraction of the data) to stabilize output formatting, followed by reinforcement learning (RL) with deterministic, exact rewards – using Group Relative Policy Optimization (GRPO). RL training employs these self-supervised tasks as prompts, with strictly computable answer sets and reasoning formats for consistent reward signals.
2. Self-Supervised Spatial Task Suite
Spatial-SSRL formulates five tasks which collectively capture a spectrum of spatial reasoning skills. Each task is defined so that the sample construction yields a deterministic solution vector, ensuring verifiable reward computation.
A. Depth-Free (RGB-Only)
- Shuffled Patch Reordering: Random patch permutation in an grid; ground-truth is inverse permutation .
- Flipped Patch Recognition: A randomly selected patch is horizontally or vertically flipped; answer is patch index and orientation, with precise flip formulas:
- Cropped Patch Inpainting: A patch is cropped and masked; among four candidates (true patch, rotated ground-truth, internal subregion, neighboring patch), select which correctly fills the region.
B. Depth-Based (RGB-D)
- Regional Depth Ordering: Three well-separated spatial regions are permuted visually; task is to order regions by increasing depth. The ground-truth constraint ensures for each region
and
- Relative 3D Position Prediction: Given two locations and an orientation, the task is to determine 's position relative to in the referential frame, using the transformation:
Categorical answer assignment is based on the signs and magnitudes of .
3. Reinforcement Learning Framework and Reward Structure
Spatial-SSRL utilizes Group Relative Policy Optimization (GRPO), a policy-gradient technique effective for deterministic, verifiable rewards. Reward for each QA prompt comprises two components: (answer correctness) and (format compliance), linearly combined ().
Training proceeds in two phases:
- Cold-start SFT: Stabilizes answer format with 4.4% of spatial-SSRL data.
- RL phase: GRPO runs on the full dataset, with tasks sampled as prompts; model must produce structured reasoning (> tags), intermediate steps, and final boxed answers.
All reward signals are computed directly from the task construction, obviating the need for external teachers, simulators, or annotated QAs.
4. Benchmark Evaluation and Empirical Outcomes
Spatial-SSRL models were evaluated on seven spatial reasoning benchmarks (e.g., Spatial457, 3DSRBench, QSpatial-plus, ViewSpatial, What'sUp, SpatialEval, VSI-Bench), as well as general VQA and fine-grained recognition tasks.
Key results:
Model Avg Spatial Acc. vs Baseline Qwen2.5-VL-3B 45.91% Base Spatial-SSRL-3B 50.54% +4.63% Qwen2.5-VL-7B 52.69% Base Spatial-SSRL-7B 56.58% +3.89% On Spatial457 (complex 3D reasoning), 12.37% absolute accuracy improvement was observed.
- Both 2D and 3D tasks contributed—neither subset alone was optimal, confirming complementarity.
- No regression in general visual or fine-grained capabilities.
- Chain-of-thought reasoning further improved spatial task results.
Ablation confirms the necessity of all task types for holistic spatial reasoning improvement.
5. Scaling, Domain Generality, and Future Extensions
Spatial-SSRL demonstrates that RLVR training for spatial intelligence can be scaled without reliance on annotated QA pairs, synthetic environments, or specialized perception modules. The framework is modular; new self-supervised tasks can be added directly by defining verifiable transformations and answer functions.
Because supervision is intrinsic and universally computable from image transformations or depth geometry, the paradigm supports domain-general learning and adaptation to highly diverse settings, e.g., arbitrary scenes, domains, or imaging modalities (RGB, RGB-D). Video-native SSL tasks (temporal coherence, optical flow) are viable future additions.
6. Conceptual Impact and Research Significance
Spatial-SSRL provides a principled methodology for improving spatial reasoning ability in LVLMs, directly addressing major weaknesses observed in prior architectures. By reframing self-supervised transformations into exact RL reward sources, this approach facilitates scalable RLVR, robust spatial intelligence, and strong empirical results on a range of benchmarks. The separation of reward evaluation from annotation or external tools represents a methodological shift in LVLM training, fostering practical solutions for spatially grounded tasks in robotics, navigation, and embodied AI.
A plausible implication is that the introduction of verifiable self-supervised spatial tasks as RL rewards may inform broader frameworks for tool-free, scalable alignment across other data domains, not limited to visual or spatial reasoning.
7. Summary Table of Spatial Pretext Tasks (from the paper)
Task Type Input Modality Reasoning Target Output Format Patch Reordering RGB 2D Layout, Ordering Patch sequence permutation Flipped Recognition RGB Orientation, Locality (patch, direction) tuple Inpainting RGB Semantic-structural fill Patch choice Depth Ordering RGB-D Ordinal 3D Structure Region sequence Rel. 3D Position RGB-D Egocentric spatial rel. Categorical relation label Spatial-SSRL’s contributions are grounded entirely in self-supervised, verifiable spatial task design and corresponding RL optimization, as documented in (Liu et al., 31 Oct 2025).