Object-Centric 3D Rollout (OCR)
- OCR is a training method for video spatial reasoning that perturbs 3D object regions and projects them into 2D to encourage holistic scene reasoning.
- It utilizes a rollout-based training pipeline, interleaving clean and region-noisy videos, which achieves state-of-the-art spatial reasoning benchmarks with compact architectures.
- Ablation studies confirm that a linear annealing schedule for perturbing object regions significantly boosts performance compared to previous global or temporal noise methods.
Object-Centric 3D Rollout (OCR) is a training methodology designed to enhance video spatial reasoning in Multi-modal LLMs (MLLMs), specifically targeting the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes. OCR introduces structured perturbations to the 3D geometry of selected objects and projects these perturbations into 2D, compelling models to perform holistic scene reasoning rather than relying solely on visible, query-named objects. Combined with a rollout-based training pipeline that interleaves clean and region-noisy videos, OCR achieves state-of-the-art results on spatial reasoning benchmarks using relatively compact neural architectures (Tang et al., 17 Nov 2025).
1. Structured Perturbations of 3D Object Geometry
OCR operates on reconstructed 3D scenes parsed from video inputs. Let denote a video depicting a dynamic scene. Axis-aligned 3D bounding boxes are extracted for all visible objects. Each bounding box contains its constituent 3D points , which are projected into each frame via the calibrated camera model:
where and define the camera intrinsics and extrinsics, respectively.
The projected 2D regions encode object 's image-space footprint across all frames. During training, a scheduler selects of these boxes, builds the region union , and injects Gaussian noise into exactly those image regions in each frame. This mechanism specifically degrades the visual signature of selected objects, forcing the model to rely on broader spatial and relational reasoning rather than shortcutting via query-focused cues.
2. 2D Projection and Synthesis of Region-Noisy Videos
Each selected object is defined by box center and half-sizes . A 3D point projects to pixel in frame . If , the pixel is replaced with independent Gaussian noise . Repeating this operation for all selected boxes synthesizes a region-noisy variant of video , where only the 2D projections of specific objects are perturbed, preserving global scene coherence and real object boundaries.
3. Rollout-Based Training Pipeline and Objective
The OCR pipeline integrates both clean and region-noisy videos into Group Relative Policy Optimization (GRPO). At each training step:
- Sample a (video, query) pair .
- Extract all and select boxes via .
- Construct and synthesize region-noisy video as above.
- Using the old policy , generate rollouts each for the clean and noisy pairs.
- Evaluate rollouts with a rule-based reward , compute advantage normalization:
- Compute the PPO-style policy loss, where only the clean rollouts () contribute to the gradient, but advantage statistics use all $2n$ rollouts:
with , and a KL-penalty to the reference policy .
4. Comparison with Existing Rollout Strategies
OCR is contrasted with prior rollout regularization techniques:
| Strategy | Type of Perturbation | Spatial Structure |
|---|---|---|
| T-GRPO (Video-R1) | Temporal shuffling of video frames | No |
| NoisyRollout | Global pixel noise to whole frames | No |
| OCR | Noise on projected 2D object regions | Yes |
OCR specifically targets the 2D regions corresponding to projected 3D object bounding boxes, unlike T-GRPO and NoisyRollout, which alter global or temporal aspects but lack spatial object selectivity. This region-based perturbation compels the model toward holistic global scene reasoning through tightly controlled degradation of object evidence.
5. Experimental Evaluation and Benchmarking
The OCR methodology was instantiated with a Qwen2.5-VL-3B-Instruct backbone, trained with 16-frame input videos and evaluated at 32-frame inference. The two-stage training comprised SFT (2 epochs on ~2K chain-of-thought spatial reasoning samples—“OCR-SFT”) and RL fine-tuning (2,000 GRPO steps over 98K samples—“OCR-RL”), mixing 4 clean and 4 noisy rollouts per step. Training utilized an 8×A100 GPU cluster.
OCR was assessed on the VSI-Bench, which contains ~5,130 samples measuring eight video spatial reasoning skills, scored as percent-correct. Main results are presented below (selected models):
| Model | #Params | Avg. Accuracy |
|---|---|---|
| GPT-4o (API) | – | 34.0 |
| Gemini-1.5-Pro (API) | – | 45.4 |
| InternVL2-8B | 8B | 34.6 |
| VG-LLM (4B) | 4B | 46.1 |
| Video-R1 (7B) | 7B | 37.1 |
| SpaceR (7B) | 7B | 45.6 |
| OCR (3B) | 3B | 47.5 |
Per-category results for OCR compared to the next-best open-source method:
- Object count: 63.2% (vs. 59.9%)
- Absolute distance: 34.1% (vs. 29.6%)
- Object size: 57.4% (vs. 50.8%)
- Room size: 46.7% (vs. 48.3%)
- Relative distance: 39.6% (vs. 35.4%)
- Relative direction: 45.5% (vs. 35.6%)
- Route planning: 44.3% (vs. 34.0%)
- Appearance order: 49.8% (vs. 31.5%)
6. Ablation Studies and Design Analysis
Ablation and design-validation experiments confirm the critical contributions of OCR’s object-centric noise and cold-start data:
- Replacing the Video-R1 cold-start set (51 samples) with the OCR-SFT set (~2K samples) yields +5.3 pp in vanilla GRPO and +1.9 pp in OCR.
- Policy variant outcomes (average accuracy):
- Baseline (no GRPO): 35.1
- +Vanilla GRPO: 43.9
- +T-GRPO: 44.1
- +NoisyRollout: 45.6
- +Downsample Rollout: 44.2
- +OCR: 47.5
- Object scheduler policies:
- Fixed 25% object perturb: 45.0
- Linear decay 50%→0%: 47.5
- Exponential decay: 46.0
- Cosine decay: 46.3
A linear annealing schedule over both number of perturbed objects and noise level most effectively balances curriculum difficulty.
7. Significance for Video Spatial Reasoning in MLLMs
OCR addresses the observed limitation of “query-locked” reasoning in current vision-LLMs, which tend to focus disproportionately on objects explicitly mentioned in the query and neglect wider scene structure. By enforcing global scene reasoning through spatially and geometrically structured noise, OCR achieves state-of-the-art results on challenging spatial reasoning tasks with a smaller parameter count (3B) than competing 7B-class baselines. The combination of high-quality cold-start data, strict region-based perturbations, mathematically grounded projection, and rollout-based curriculum is validated via comprehensive ablation and benchmark studies, highlighting its impact within the spatial reasoning landscape (Tang et al., 17 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free