Spatial-SSRL: Self-Supervised Spatial RL

Updated 4 November 2025

Spatial-SSRL is a self-supervised reinforcement learning paradigm that replaces costly annotations with verifiable spatial pretext tasks from RGB or RGB-D images.
It employs a brief supervised fine-tuning phase followed by Group Relative Policy Optimization to yield deterministic rewards based on reproducible spatial reasoning prompts.
Empirical results reveal significant accuracy improvements across 2D and 3D tasks, highlighting its scalable and domain-general approach for LVLM training.

Spatial-SSRL is a self-supervised reinforcement learning paradigm developed to enhance the spatial reasoning abilities of large vision-LLMs (LVLMs). It replaces costly annotated supervision and restricted RLVR tools with verifiable signals derived directly from RGB or RGB-D images, making spatially grounded training scalable and domain-general. This framework automatically generates five intrinsically verifiable spatial pretext tasks for RL optimization, enabling systematic improvement of both two-dimensional and three-dimensional spatial understanding.

1. Methodological Foundations

Spatial-SSRL defines a pipeline in which self-supervised learning pretext tasks are reformulated so their solutions are algorithmically verifiable. These pretext tasks provide instant ground-truth labels without human or model-based annotation, circumventing bottlenecks of prior RLVR paradigms.

The training protocol consists first of a short supervised fine-tuning (SFT) “cold-start” phase (using a small fraction of the data) to stabilize output formatting, followed by reinforcement learning (RL) with deterministic, exact rewards – using Group Relative Policy Optimization (GRPO). RL training employs these self-supervised tasks as prompts, with strictly computable answer sets and reasoning formats for consistent reward signals.

2. Self-Supervised Spatial Task Suite

Spatial-SSRL formulates five tasks which collectively capture a spectrum of spatial reasoning skills. Each task is defined so that the sample construction yields a deterministic solution vector, ensuring verifiable reward computation.

A. Depth-Free (RGB-Only)

Shuffled Patch Reordering: Random patch permutation in an $M \times N$ grid; ground-truth is inverse permutation $\pi^{-1}$ .
Flipped Patch Recognition: A randomly selected patch is horizontally or vertically flipped; answer is patch index and orientation, with precise flip formulas:

$x_{\mathrm{vert}}(r, c) = x(P_H-1 - r, c), \quad x_{\mathrm{horz}}(r, c) = x(r, P_W-1 - c)$

Cropped Patch Inpainting: A patch is cropped and masked; among four candidates (true patch, rotated ground-truth, internal subregion, neighboring patch), select which correctly fills the region.

B. Depth-Based (RGB-D)

Regional Depth Ordering: Three well-separated spatial regions $R_1, R_2, R_3$ are permuted visually; task is to order regions by increasing depth. The ground-truth constraint ensures for each region

$r(R_i) = \max_{(x, y)\in R_i} D(x,y) - \min_{(x, y)\in R_i} D(x,y) < r_\text{max}$

and

$d(R_i, R_{i+1}) = \min_{(x, y)\in R_{i+1}} D(x, y) - \max_{(x, y)\in R_i} D(x, y) > d_\text{min}$

Relative 3D Position Prediction: Given two locations and an orientation, the task is to determine $R_2$ 's position relative to $R_1$ in the referential frame, using the transformation:

$\begin{pmatrix} \tilde{x}_2\ \tilde{z}_2\ 1 \end{pmatrix} = \begin{pmatrix} \cos\theta & \sin\theta & 0 \ -\sin\theta & \cos\theta & 0 \ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & -x_1 \ 0 & 1 & -z_1 \ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x_2\ z_2\ 1 \end{pmatrix}$

Categorical answer assignment is based on the signs and magnitudes of $(\tilde{x}_2,\,\tilde{z}_2)$ .

3. Reinforcement Learning Framework and Reward Structure

Spatial-SSRL utilizes Group Relative Policy Optimization (GRPO), a policy-gradient technique effective for deterministic, verifiable rewards. Reward for each QA prompt comprises two components: $r_{\mathrm{acc}}$ (answer correctness) and $r_{\mathrm{fmt}}$ (format compliance), linearly combined ( $r=0.9\, r_{\mathrm{acc}} + 0.1\, r_{\mathrm{fmt}}$ ).

Training proceeds in two phases:

Cold-start SFT: Stabilizes answer format with $\sim$ 4.4% of spatial-SSRL data.

RL phase: GRPO runs on the full dataset, with tasks sampled as prompts; model must produce structured reasoning (> tags), intermediate steps, and final boxed answers.

All reward signals are computed directly from the task construction, obviating the need for external teachers, simulators, or annotated QAs.

4. Benchmark Evaluation and Empirical Outcomes

Spatial-SSRL models were evaluated on seven spatial reasoning benchmarks (e.g., Spatial457, 3DSRBench, QSpatial-plus, ViewSpatial, What'sUp, SpatialEval, VSI-Bench), as well as general VQA and fine-grained recognition tasks.

Key results:

Model Avg Spatial Acc. vs Baseline

Qwen2.5-VL-3B 45.91% Base

Spatial-SSRL-3B 50.54% +4.63%

Qwen2.5-VL-7B 52.69% Base

Spatial-SSRL-7B 56.58% +3.89%

On Spatial457 (complex 3D reasoning), 12.37% absolute accuracy improvement was observed.

Both 2D and 3D tasks contributed—neither subset alone was optimal, confirming complementarity.

No regression in general visual or fine-grained capabilities.

Chain-of-thought reasoning further improved spatial task results.

Ablation confirms the necessity of all task types for holistic spatial reasoning improvement.

5. Scaling, Domain Generality, and Future Extensions

Spatial-SSRL demonstrates that RLVR training for spatial intelligence can be scaled without reliance on annotated QA pairs, synthetic environments, or specialized perception modules. The framework is modular; new self-supervised tasks can be added directly by defining verifiable transformations and answer functions.

Because supervision is intrinsic and universally computable from image transformations or depth geometry, the paradigm supports domain-general learning and adaptation to highly diverse settings, e.g., arbitrary scenes, domains, or imaging modalities (RGB, RGB-D). Video-native SSL tasks (temporal coherence, optical flow) are viable future additions.

6. Conceptual Impact and Research Significance

Spatial-SSRL provides a principled methodology for improving spatial reasoning ability in LVLMs, directly addressing major weaknesses observed in prior architectures. By reframing self-supervised transformations into exact RL reward sources, this approach facilitates scalable RLVR, robust spatial intelligence, and strong empirical results on a range of benchmarks. The separation of reward evaluation from annotation or external tools represents a methodological shift in LVLM training, fostering practical solutions for spatially grounded tasks in robotics, navigation, and embodied AI.

A plausible implication is that the introduction of verifiable self-supervised spatial tasks as RL rewards may inform broader frameworks for tool-free, scalable alignment across other data domains, not limited to visual or spatial reasoning.

7. Summary Table of Spatial Pretext Tasks (from the paper)
Task Type Input Modality Reasoning Target Output Format

Patch Reordering RGB 2D Layout, Ordering Patch sequence permutation

Flipped Recognition RGB Orientation, Locality (patch, direction) tuple

Inpainting RGB Semantic-structural fill Patch choice

Depth Ordering RGB-D Ordinal 3D Structure Region sequence

Rel. 3D Position RGB-D Egocentric spatial rel. Categorical relation label

Spatial-SSRL’s contributions are grounded entirely in self-supervised, verifiable spatial task design and corresponding RL optimization, as documented in (Liu et al., 31 Oct 2025).

Model	Avg Spatial Acc.	vs Baseline
Qwen2.5-VL-3B	45.91%	Base
Spatial-SSRL-3B	50.54%	+4.63%
Qwen2.5-VL-7B	52.69%	Base
Spatial-SSRL-7B	56.58%	+3.89%

Task Type	Input Modality	Reasoning Target	Output Format
Patch Reordering	RGB	2D Layout, Ordering	Patch sequence permutation
Flipped Recognition	RGB	Orientation, Locality	(patch, direction) tuple
Inpainting	RGB	Semantic-structural fill	Patch choice
Depth Ordering	RGB-D	Ordinal 3D Structure	Region sequence
Rel. 3D Position	RGB-D	Egocentric spatial rel.	Categorical relation label

PDF Markdown Chat (Pro)

References (1)

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Spatial-SSRL.

Spatial-SSRL: Self-Supervised Spatial RL

1. Methodological Foundations

2. Self-Supervised Spatial Task Suite

A. Depth-Free (RGB-Only)

B. Depth-Based (RGB-D)

3. Reinforcement Learning Framework and Reward Structure

4. Benchmark Evaluation and Empirical Outcomes

5. Scaling, Domain Generality, and Future Extensions

6. Conceptual Impact and Research Significance

7. Summary Table of Spatial Pretext Tasks (from the paper)

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics