Object-Centric 3D Rollout (OCR)

Updated 24 November 2025

OCR is a training method for video spatial reasoning that perturbs 3D object regions and projects them into 2D to encourage holistic scene reasoning.
It utilizes a rollout-based training pipeline, interleaving clean and region-noisy videos, which achieves state-of-the-art spatial reasoning benchmarks with compact architectures.
Ablation studies confirm that a linear annealing schedule for perturbing object regions significantly boosts performance compared to previous global or temporal noise methods.

Object-Centric 3D Rollout (OCR) is a training methodology designed to enhance video spatial reasoning in Multi-modal LLMs (MLLMs), specifically targeting the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes. OCR introduces structured perturbations to the 3D geometry of selected objects and projects these perturbations into 2D, compelling models to perform holistic scene reasoning rather than relying solely on visible, query-named objects. Combined with a rollout-based training pipeline that interleaves clean and region-noisy videos, OCR achieves state-of-the-art results on spatial reasoning benchmarks using relatively compact neural architectures (Tang et al., 17 Nov 2025).

1. Structured Perturbations of 3D Object Geometry

OCR operates on reconstructed 3D scenes parsed from video inputs. Let $V$ denote a video depicting a dynamic scene. Axis-aligned 3D bounding boxes $B_1, \ldots, B_M$ are extracted for all visible objects. Each bounding box $B_j$ contains its constituent 3D points $(x, y, z)$ , which are projected into each frame $f_k$ via the calibrated camera model:

$(u, v, 1)^T \propto K[R|t] \cdot (x, y, z, 1)^T$

where $K$ and $[R|t]$ define the camera intrinsics and extrinsics, respectively.

The projected 2D regions $R_j = \{b_{j, f_1}, ..., b_{j, f_k}\}$ encode object $j$ 's image-space footprint across all frames. During training, a scheduler $T_{\delta_t}$ selects $m \ll M$ of these boxes, builds the region union $\hat{R} = \bigcup_{j=1}^m R_j$ , and injects Gaussian noise into exactly those image regions in each frame. This mechanism specifically degrades the visual signature of selected objects, forcing the model to rely on broader spatial and relational reasoning rather than shortcutting via query-focused cues.

2. 2D Projection and Synthesis of Region-Noisy Videos

Each selected object is defined by box center $(x_j, y_j, z_j)$ and half-sizes $(\Delta x_j, \Delta y_j, \Delta z_j)$ . A 3D point $p \in B_j$ projects to pixel $p_f = K[R|t][x~ y~ z~ 1]^T$ in frame $f$ . If $p_f \in \hat{R}$ , the pixel is replaced with independent Gaussian noise $N(0, \sigma^2)$ . Repeating this operation for all $m$ selected boxes synthesizes a region-noisy variant $\hat{W}$ of video $V$ , where only the 2D projections of specific objects are perturbed, preserving global scene coherence and real object boundaries.

3. Rollout-Based Training Pipeline and Objective

The OCR pipeline integrates both clean and region-noisy videos into Group Relative Policy Optimization (GRPO). At each training step:

Sample a (video, query) pair $(V, q)$ .
Extract all $B_j$ and select $m$ boxes via $T_{\delta_t}$ .
Construct $\hat{R}$ and synthesize region-noisy video $\hat{W}$ as above.
Using the old policy $\pi_{\theta_{\text{old}}}$ , generate $n$ rollouts each for the clean $(V, q)$ and noisy $(\hat{W}, q)$ pairs.
Evaluate rollouts with a rule-based reward $r(\cdot)$ , compute advantage normalization:

$\mu = \text{mean}(r_1, \ldots, r_{2n}),\quad \sigma = \text{std}(r_1, \ldots, r_{2n}),\quad \hat{A}_i = \frac{r_i - \mu}{\sigma}$

Compute the PPO-style policy loss, where only the clean rollouts ( $i \leq n$ ) contribute to the gradient, but advantage statistics use all $2n$ rollouts:

$J_{\text{OCR}}(\theta) = \mathbb{E}_{(V, q), o_{1 \ldots 2n}} \left[ \frac{1}{2n} \sum_{i=1}^{2n} \min \left( \rho_i \hat{A}_i,\; \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]$

with $\rho_i = \pi_\theta(o_i|V, q) / \pi_{\theta_{\text{old}}}(o_i|V, q)$ , and a KL-penalty to the reference policy $\pi_{\text{ref}}$ .

4. Comparison with Existing Rollout Strategies

OCR is contrasted with prior rollout regularization techniques:

Strategy	Type of Perturbation	Spatial Structure
T-GRPO (Video-R1)	Temporal shuffling of video frames	No
NoisyRollout	Global pixel noise to whole frames	No
OCR	Noise on projected 2D object regions	Yes

OCR specifically targets the 2D regions corresponding to projected 3D object bounding boxes, unlike T-GRPO and NoisyRollout, which alter global or temporal aspects but lack spatial object selectivity. This region-based perturbation compels the model toward holistic global scene reasoning through tightly controlled degradation of object evidence.

5. Experimental Evaluation and Benchmarking

The OCR methodology was instantiated with a Qwen2.5-VL-3B-Instruct backbone, trained with 16-frame input videos and evaluated at 32-frame inference. The two-stage training comprised SFT (2 epochs on ~2K chain-of-thought spatial reasoning samples—“OCR-SFT”) and RL fine-tuning (2,000 GRPO steps over 98K samples—“OCR-RL”), mixing 4 clean and 4 noisy rollouts per step. Training utilized an 8×A100 GPU cluster.

OCR was assessed on the VSI-Bench, which contains ~5,130 samples measuring eight video spatial reasoning skills, scored as percent-correct. Main results are presented below (selected models):

Model	#Params	Avg. Accuracy
GPT-4o (API)	–	34.0
Gemini-1.5-Pro (API)	–	45.4
InternVL2-8B	8B	34.6
VG-LLM (4B)	4B	46.1
Video-R1 (7B)	7B	37.1
SpaceR (7B)	7B	45.6
OCR (3B)	3B	47.5

Per-category results for OCR compared to the next-best open-source method:

Object count: 63.2% (vs. 59.9%)
Absolute distance: 34.1% (vs. 29.6%)
Object size: 57.4% (vs. 50.8%)
Room size: 46.7% (vs. 48.3%)
Relative distance: 39.6% (vs. 35.4%)
Relative direction: 45.5% (vs. 35.6%)
Route planning: 44.3% (vs. 34.0%)
Appearance order: 49.8% (vs. 31.5%)

6. Ablation Studies and Design Analysis

Ablation and design-validation experiments confirm the critical contributions of OCR’s object-centric noise and cold-start data:

Replacing the Video-R1 cold-start set (51 samples) with the OCR-SFT set (~2K samples) yields +5.3 pp in vanilla GRPO and +1.9 pp in OCR.
Policy variant outcomes (average accuracy):
- Baseline (no GRPO): 35.1
- +Vanilla GRPO: 43.9
- +T-GRPO: 44.1
- +NoisyRollout: 45.6
- +Downsample Rollout: 44.2
- +OCR: 47.5
Object scheduler policies:
- Fixed 25% object perturb: 45.0
- Linear decay 50%→0%: 47.5
- Exponential decay: 46.0
- Cosine decay: 46.3

A linear annealing schedule over both number of perturbed objects and noise level most effectively balances curriculum difficulty.

7. Significance for Video Spatial Reasoning in MLLMs

OCR addresses the observed limitation of “query-locked” reasoning in current vision-LLMs, which tend to focus disproportionately on objects explicitly mentioned in the query and neglect wider scene structure. By enforcing global scene reasoning through spatially and geometrically structured noise, OCR achieves state-of-the-art results on challenging spatial reasoning tasks with a smaller parameter count (3B) than competing 7B-class baselines. The combination of high-quality cold-start data, strict region-based perturbations, mathematically grounded projection, and rollout-based curriculum is validated via comprehensive ablation and benchmark studies, highlighting its impact within the spatial reasoning landscape (Tang et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Video Spatial Reasoning with Object-Centric 3D Rollout (2025)

Follow Topic

Get notified by email when new papers are published related to Object-Centric 3D Rollout (OCR).