Papers
Topics
Authors
Recent
2000 character limit reached

Object-Centric 3D Rollout (OCR)

Updated 24 November 2025
  • OCR is a training method for video spatial reasoning that perturbs 3D object regions and projects them into 2D to encourage holistic scene reasoning.
  • It utilizes a rollout-based training pipeline, interleaving clean and region-noisy videos, which achieves state-of-the-art spatial reasoning benchmarks with compact architectures.
  • Ablation studies confirm that a linear annealing schedule for perturbing object regions significantly boosts performance compared to previous global or temporal noise methods.

Object-Centric 3D Rollout (OCR) is a training methodology designed to enhance video spatial reasoning in Multi-modal LLMs (MLLMs), specifically targeting the ability to comprehend object locations, orientations, and inter-object relationships in dynamic 3D scenes. OCR introduces structured perturbations to the 3D geometry of selected objects and projects these perturbations into 2D, compelling models to perform holistic scene reasoning rather than relying solely on visible, query-named objects. Combined with a rollout-based training pipeline that interleaves clean and region-noisy videos, OCR achieves state-of-the-art results on spatial reasoning benchmarks using relatively compact neural architectures (Tang et al., 17 Nov 2025).

1. Structured Perturbations of 3D Object Geometry

OCR operates on reconstructed 3D scenes parsed from video inputs. Let VV denote a video depicting a dynamic scene. Axis-aligned 3D bounding boxes B1,,BMB_1, \ldots, B_M are extracted for all visible objects. Each bounding box BjB_j contains its constituent 3D points (x,y,z)(x, y, z), which are projected into each frame fkf_k via the calibrated camera model:

(u,v,1)TK[Rt](x,y,z,1)T(u, v, 1)^T \propto K[R|t] \cdot (x, y, z, 1)^T

where KK and [Rt][R|t] define the camera intrinsics and extrinsics, respectively.

The projected 2D regions Rj={bj,f1,...,bj,fk}R_j = \{b_{j, f_1}, ..., b_{j, f_k}\} encode object jj's image-space footprint across all frames. During training, a scheduler TδtT_{\delta_t} selects mMm \ll M of these boxes, builds the region union R^=j=1mRj\hat{R} = \bigcup_{j=1}^m R_j, and injects Gaussian noise into exactly those image regions in each frame. This mechanism specifically degrades the visual signature of selected objects, forcing the model to rely on broader spatial and relational reasoning rather than shortcutting via query-focused cues.

2. 2D Projection and Synthesis of Region-Noisy Videos

Each selected object is defined by box center (xj,yj,zj)(x_j, y_j, z_j) and half-sizes (Δxj,Δyj,Δzj)(\Delta x_j, \Delta y_j, \Delta z_j). A 3D point pBjp \in B_j projects to pixel pf=K[Rt][x y z 1]Tp_f = K[R|t][x~ y~ z~ 1]^T in frame ff. If pfR^p_f \in \hat{R}, the pixel is replaced with independent Gaussian noise N(0,σ2)N(0, \sigma^2). Repeating this operation for all mm selected boxes synthesizes a region-noisy variant W^\hat{W} of video VV, where only the 2D projections of specific objects are perturbed, preserving global scene coherence and real object boundaries.

3. Rollout-Based Training Pipeline and Objective

The OCR pipeline integrates both clean and region-noisy videos into Group Relative Policy Optimization (GRPO). At each training step:

  1. Sample a (video, query) pair (V,q)(V, q).
  2. Extract all BjB_j and select mm boxes via TδtT_{\delta_t}.
  3. Construct R^\hat{R} and synthesize region-noisy video W^\hat{W} as above.
  4. Using the old policy πθold\pi_{\theta_{\text{old}}}, generate nn rollouts each for the clean (V,q)(V, q) and noisy (W^,q)(\hat{W}, q) pairs.
  5. Evaluate rollouts with a rule-based reward r()r(\cdot), compute advantage normalization:

μ=mean(r1,,r2n),σ=std(r1,,r2n),A^i=riμσ\mu = \text{mean}(r_1, \ldots, r_{2n}),\quad \sigma = \text{std}(r_1, \ldots, r_{2n}),\quad \hat{A}_i = \frac{r_i - \mu}{\sigma}

  1. Compute the PPO-style policy loss, where only the clean rollouts (ini \leq n) contribute to the gradient, but advantage statistics use all $2n$ rollouts:

JOCR(θ)=E(V,q),o12n[12ni=12nmin(ρiA^i,  clip(ρi,1ϵ,1+ϵ)A^i)βDKL(πθπref)]J_{\text{OCR}}(\theta) = \mathbb{E}_{(V, q), o_{1 \ldots 2n}} \left[ \frac{1}{2n} \sum_{i=1}^{2n} \min \left( \rho_i \hat{A}_i,\; \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]

with ρi=πθ(oiV,q)/πθold(oiV,q)\rho_i = \pi_\theta(o_i|V, q) / \pi_{\theta_{\text{old}}}(o_i|V, q), and a KL-penalty to the reference policy πref\pi_{\text{ref}}.

4. Comparison with Existing Rollout Strategies

OCR is contrasted with prior rollout regularization techniques:

Strategy Type of Perturbation Spatial Structure
T-GRPO (Video-R1) Temporal shuffling of video frames No
NoisyRollout Global pixel noise to whole frames No
OCR Noise on projected 2D object regions Yes

OCR specifically targets the 2D regions corresponding to projected 3D object bounding boxes, unlike T-GRPO and NoisyRollout, which alter global or temporal aspects but lack spatial object selectivity. This region-based perturbation compels the model toward holistic global scene reasoning through tightly controlled degradation of object evidence.

5. Experimental Evaluation and Benchmarking

The OCR methodology was instantiated with a Qwen2.5-VL-3B-Instruct backbone, trained with 16-frame input videos and evaluated at 32-frame inference. The two-stage training comprised SFT (2 epochs on ~2K chain-of-thought spatial reasoning samples—“OCR-SFT”) and RL fine-tuning (2,000 GRPO steps over 98K samples—“OCR-RL”), mixing 4 clean and 4 noisy rollouts per step. Training utilized an 8×A100 GPU cluster.

OCR was assessed on the VSI-Bench, which contains ~5,130 samples measuring eight video spatial reasoning skills, scored as percent-correct. Main results are presented below (selected models):

Model #Params Avg. Accuracy
GPT-4o (API) 34.0
Gemini-1.5-Pro (API) 45.4
InternVL2-8B 8B 34.6
VG-LLM (4B) 4B 46.1
Video-R1 (7B) 7B 37.1
SpaceR (7B) 7B 45.6
OCR (3B) 3B 47.5

Per-category results for OCR compared to the next-best open-source method:

  • Object count: 63.2% (vs. 59.9%)
  • Absolute distance: 34.1% (vs. 29.6%)
  • Object size: 57.4% (vs. 50.8%)
  • Room size: 46.7% (vs. 48.3%)
  • Relative distance: 39.6% (vs. 35.4%)
  • Relative direction: 45.5% (vs. 35.6%)
  • Route planning: 44.3% (vs. 34.0%)
  • Appearance order: 49.8% (vs. 31.5%)

6. Ablation Studies and Design Analysis

Ablation and design-validation experiments confirm the critical contributions of OCR’s object-centric noise and cold-start data:

  • Replacing the Video-R1 cold-start set (51 samples) with the OCR-SFT set (~2K samples) yields +5.3 pp in vanilla GRPO and +1.9 pp in OCR.
  • Policy variant outcomes (average accuracy):
    • Baseline (no GRPO): 35.1
    • +Vanilla GRPO: 43.9
    • +T-GRPO: 44.1
    • +NoisyRollout: 45.6
    • +Downsample Rollout: 44.2
    • +OCR: 47.5
  • Object scheduler policies:
    • Fixed 25% object perturb: 45.0
    • Linear decay 50%→0%: 47.5
    • Exponential decay: 46.0
    • Cosine decay: 46.3

A linear annealing schedule over both number of perturbed objects and noise level most effectively balances curriculum difficulty.

7. Significance for Video Spatial Reasoning in MLLMs

OCR addresses the observed limitation of “query-locked” reasoning in current vision-LLMs, which tend to focus disproportionately on objects explicitly mentioned in the query and neglect wider scene structure. By enforcing global scene reasoning through spatially and geometrically structured noise, OCR achieves state-of-the-art results on challenging spatial reasoning tasks with a smaller parameter count (3B) than competing 7B-class baselines. The combination of high-quality cold-start data, strict region-based perturbations, mathematically grounded projection, and rollout-based curriculum is validated via comprehensive ablation and benchmark studies, highlighting its impact within the spatial reasoning landscape (Tang et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Object-Centric 3D Rollout (OCR).