Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 68 tok/s
Gemini 2.5 Flash 155 tok/s Pro
Gemini 2.5 Pro 51 tok/s Pro
Kimi K2 187 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Spatial-SSRL: Self-Supervised Spatial RL

Updated 4 November 2025
  • Spatial-SSRL is a self-supervised reinforcement learning paradigm that replaces costly annotations with verifiable spatial pretext tasks from RGB or RGB-D images.
  • It employs a brief supervised fine-tuning phase followed by Group Relative Policy Optimization to yield deterministic rewards based on reproducible spatial reasoning prompts.
  • Empirical results reveal significant accuracy improvements across 2D and 3D tasks, highlighting its scalable and domain-general approach for LVLM training.

Spatial-SSRL is a self-supervised reinforcement learning paradigm developed to enhance the spatial reasoning abilities of large vision-LLMs (LVLMs). It replaces costly annotated supervision and restricted RLVR tools with verifiable signals derived directly from RGB or RGB-D images, making spatially grounded training scalable and domain-general. This framework automatically generates five intrinsically verifiable spatial pretext tasks for RL optimization, enabling systematic improvement of both two-dimensional and three-dimensional spatial understanding.

1. Methodological Foundations

Spatial-SSRL defines a pipeline in which self-supervised learning pretext tasks are reformulated so their solutions are algorithmically verifiable. These pretext tasks provide instant ground-truth labels without human or model-based annotation, circumventing bottlenecks of prior RLVR paradigms.

The training protocol consists first of a short supervised fine-tuning (SFT) “cold-start” phase (using a small fraction of the data) to stabilize output formatting, followed by reinforcement learning (RL) with deterministic, exact rewards – using Group Relative Policy Optimization (GRPO). RL training employs these self-supervised tasks as prompts, with strictly computable answer sets and reasoning formats for consistent reward signals.

2. Self-Supervised Spatial Task Suite

Spatial-SSRL formulates five tasks which collectively capture a spectrum of spatial reasoning skills. Each task is defined so that the sample construction yields a deterministic solution vector, ensuring verifiable reward computation.

A. Depth-Free (RGB-Only)

  1. Shuffled Patch Reordering: Random patch permutation in an M×NM \times N grid; ground-truth is inverse permutation π1\pi^{-1}.
  2. Flipped Patch Recognition: A randomly selected patch is horizontally or vertically flipped; answer is patch index and orientation, with precise flip formulas:

xvert(r,c)=x(PH1r,c),xhorz(r,c)=x(r,PW1c)x_{\mathrm{vert}}(r, c) = x(P_H-1 - r, c), \quad x_{\mathrm{horz}}(r, c) = x(r, P_W-1 - c)

  1. Cropped Patch Inpainting: A patch is cropped and masked; among four candidates (true patch, rotated ground-truth, internal subregion, neighboring patch), select which correctly fills the region.

B. Depth-Based (RGB-D)

  1. Regional Depth Ordering: Three well-separated spatial regions R1,R2,R3R_1, R_2, R_3 are permuted visually; task is to order regions by increasing depth. The ground-truth constraint ensures for each region

r(Ri)=max(x,y)RiD(x,y)min(x,y)RiD(x,y)<rmaxr(R_i) = \max_{(x, y)\in R_i} D(x,y) - \min_{(x, y)\in R_i} D(x,y) < r_\text{max}

and

d(Ri,Ri+1)=min(x,y)Ri+1D(x,y)max(x,y)RiD(x,y)>dmind(R_i, R_{i+1}) = \min_{(x, y)\in R_{i+1}} D(x, y) - \max_{(x, y)\in R_i} D(x, y) > d_\text{min}

  1. Relative 3D Position Prediction: Given two locations and an orientation, the task is to determine R2R_2's position relative to R1R_1 in the referential frame, using the transformation:

(x~2 z~2 1)=(cosθsinθ0 sinθcosθ0 001)(10x1 01z1 001)(x2 z2 1)\begin{pmatrix} \tilde{x}_2\ \tilde{z}_2\ 1 \end{pmatrix} = \begin{pmatrix} \cos\theta & \sin\theta & 0 \ -\sin\theta & \cos\theta & 0 \ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & -x_1 \ 0 & 1 & -z_1 \ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} x_2\ z_2\ 1 \end{pmatrix}

Categorical answer assignment is based on the signs and magnitudes of (x~2,z~2)(\tilde{x}_2,\,\tilde{z}_2).

3. Reinforcement Learning Framework and Reward Structure

Spatial-SSRL utilizes Group Relative Policy Optimization (GRPO), a policy-gradient technique effective for deterministic, verifiable rewards. Reward for each QA prompt comprises two components: raccr_{\mathrm{acc}} (answer correctness) and rfmtr_{\mathrm{fmt}} (format compliance), linearly combined (r=0.9racc+0.1rfmtr=0.9\, r_{\mathrm{acc}} + 0.1\, r_{\mathrm{fmt}}).

Training proceeds in two phases:

  • Cold-start SFT: Stabilizes answer format with \sim4.4% of spatial-SSRL data.
  • RL phase: GRPO runs on the full dataset, with tasks sampled as prompts; model must produce structured reasoning (> tags), intermediate steps, and final boxed answers.

    All reward signals are computed directly from the task construction, obviating the need for external teachers, simulators, or annotated QAs.

    4. Benchmark Evaluation and Empirical Outcomes

    Spatial-SSRL models were evaluated on seven spatial reasoning benchmarks (e.g., Spatial457, 3DSRBench, QSpatial-plus, ViewSpatial, What'sUp, SpatialEval, VSI-Bench), as well as general VQA and fine-grained recognition tasks.

    Key results:

    Model Avg Spatial Acc. vs Baseline
    Qwen2.5-VL-3B 45.91% Base
    Spatial-SSRL-3B 50.54% +4.63%
    Qwen2.5-VL-7B 52.69% Base
    Spatial-SSRL-7B 56.58% +3.89%
    • On Spatial457 (complex 3D reasoning), 12.37% absolute accuracy improvement was observed.

    • Both 2D and 3D tasks contributed—neither subset alone was optimal, confirming complementarity.
    • No regression in general visual or fine-grained capabilities.
    • Chain-of-thought reasoning further improved spatial task results.

    Ablation confirms the necessity of all task types for holistic spatial reasoning improvement.

    5. Scaling, Domain Generality, and Future Extensions

    Spatial-SSRL demonstrates that RLVR training for spatial intelligence can be scaled without reliance on annotated QA pairs, synthetic environments, or specialized perception modules. The framework is modular; new self-supervised tasks can be added directly by defining verifiable transformations and answer functions.

    Because supervision is intrinsic and universally computable from image transformations or depth geometry, the paradigm supports domain-general learning and adaptation to highly diverse settings, e.g., arbitrary scenes, domains, or imaging modalities (RGB, RGB-D). Video-native SSL tasks (temporal coherence, optical flow) are viable future additions.

    6. Conceptual Impact and Research Significance

    Spatial-SSRL provides a principled methodology for improving spatial reasoning ability in LVLMs, directly addressing major weaknesses observed in prior architectures. By reframing self-supervised transformations into exact RL reward sources, this approach facilitates scalable RLVR, robust spatial intelligence, and strong empirical results on a range of benchmarks. The separation of reward evaluation from annotation or external tools represents a methodological shift in LVLM training, fostering practical solutions for spatially grounded tasks in robotics, navigation, and embodied AI.

    A plausible implication is that the introduction of verifiable self-supervised spatial tasks as RL rewards may inform broader frameworks for tool-free, scalable alignment across other data domains, not limited to visual or spatial reasoning.

    7. Summary Table of Spatial Pretext Tasks (from the paper)

    Task Type Input Modality Reasoning Target Output Format
    Patch Reordering RGB 2D Layout, Ordering Patch sequence permutation
    Flipped Recognition RGB Orientation, Locality (patch, direction) tuple
    Inpainting RGB Semantic-structural fill Patch choice
    Depth Ordering RGB-D Ordinal 3D Structure Region sequence
    Rel. 3D Position RGB-D Egocentric spatial rel. Categorical relation label

    Spatial-SSRL’s contributions are grounded entirely in self-supervised, verifiable spatial task design and corresponding RL optimization, as documented in (Liu et al., 31 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Spatial-SSRL.