- The paper presents ReOI, a test-time observation intervention strategy that counteracts distractors in visual MPC.
- It uses segmentation and inpainting to transform distractor-affected observations into in-distribution inputs, achieving SSIM 0.94 and LPIPS 0.04.
- The approach improves robot planning safety and task success, raising success rates to 70% with effective VLM-based action plan verification.
Distractor-Robust World Model Predictions via Test-time Observation Interventions for Visual MPC
This paper introduces the Reimagination with Observation Intervention (ReOI) strategy to address a crucial limitation in world model-based robot planning: the brittleness of visual world models to out-of-distribution visual distractors. The methodology proposes test-time observation interventions—rather than additional training time solutions—demonstrating effective mitigation of hallucinated outcomes in visual model predictive control (MPC) tasks.
Problem Scope and Motivation
World models have grown central in robot learning and visual MPC, simulating the effect of candidate action plans via action-conditioned rollouts in visual space. Current state-of-the-art models, including those leveraging large-scale visual representations (e.g., DINOv2), nonetheless remain acutely sensitive to novel scene elements at deployment not encountered during training. These elements—distractors—may be background artifacts, innocuous items, or objects with safety implications. Their presence typically leads to implausible predictions, such as object disappearance, warping, or the robot hallucinating unobstructed trajectories through occluded or critical regions, thereby compromising task performance and safety.
Training time strategies—domain randomization, privileged signals, or context-specific distractor suppression—are limited by their dependence on environments and distractor distributions present during training. In real-world open-set circumstances, deployment inevitably exposes systems to unmodeled distractors. Thus, the authors argue that addressing model generalization limitations at test-time is essential for robust deployment.
Method: Reimagination with Observation Intervention (ReOI)
ReOI is a cascaded, modular, and largely model-agnostic approach applied entirely at test-time, without additional model retraining or alterations. The pipeline comprises:
- Distractor Identification: Leveraging the tendency of unfamiliar distractors to exhibit physically implausible temporal evolution under the world model, ReOI performs a “rollout” using a canonical (in-distribution) action plan. A Vision-LLM (VLM, specifically GPT-4o) is prompted to reason over the initial and a later frame, referencing semantic segmentation masks with unique IDs, to identify object regions that disappear or degrade implausibly. Identified regions are flagged as novel distractors.
- Segmentation and Inpainting: Using state-of-the-art segmentation (Grounded-SAM2) the system extracts spatial masks for distractors, then inpaints these with a diffusion-based model (ROVI-AUG), restoring the occluded image regions to plausible, in-distribution states. This produces a distractor-free observation, closer to the model’s training manifold.
- Reimagination and Rollout: The inpainted, distractor-free observation serves as the initial input for the world model. Rollouts from this “intervened” observation yield future predictions less influenced by unseen artifacts, avoiding physical implausibility in both robot and object dynamics.
- Distractor Re-insertion for Consistent MPC: To maintain temporal and spatial consistency for downstream planners and verifiers, distractors are composited back into predicted frames post-hoc. The layering uses depth-aware compositing, enforcing correct occlusion relationships and re-introducing the distractors visually (though the world model itself did not simulate interaction with them). Any predicted trajectory intersecting an occluded region is automatically rejected by the verifier as unsafe.
- Action Plan Selection and Verification: A VLM section verifies proposed action plans, using the composited rollouts and textual task specification to select trajectories that avoid safety violations and best match user intent.
Implementation Details
- Policy and Model Training: The robot policy is trained via Diffusion Policy, on 120 demonstration trajectories. The world model is based on DINOv2 latent representations (“medium_gray”) with 500 training trajectories (200 policy rollouts; 300 random). Inpainting, segmentation, and depth estimation use off-the-shelf models. All inference is performed without model retraining, and the ReOI pipeline is deployed in a real-world tabletop manipulation environment.
- Hardware: Training is performed on a single Nvidia A6000 GPU; execution speed primarily depends on the efficiency of the segmentation and inpainting models, as well as the world model's rollout throughput.
Results
The paper presents both qualitative and quantitative evaluations of ReOI in challenging tabletop manipulation tasks.
Prediction Quality
- SSIM/LPIPS: On full-frame metrics, ReOI achieves SSIM 0.94 and LPIPS 0.04, substantially outperforming the baseline world model (SSIM 0.51, LPIPS 0.17) in scenarios with novel distractors. When distractor regions are masked out for a fair assessment of in-distribution object dynamics, ReOI still shows marked superiority.
- Qualitative: ReOI prevents the disappearance or hallucination of task-relevant objects and robot arms (e.g., green pepper consistently present, no target object erasure), where the baseline model exhibits severe distortions.
System-level Planning and Safety
- Task Success Rate: Using VLM-based plan verification, ReOI produces a 70% success rate (compared to 20% for the raw model and 0% for a conservative trust-region rejection baseline). The collision rate is reduced to 10% (on par with the conservative baseline).
Plan Verification
- Verifier Accuracy: GPT-4o as a plan verifier achieves 94% verification and 85% selection accuracy on ReOI-processed rollouts, indicating competent selection of safe, intent-aligned action plans based on improved predictions.
Practical and Theoretical Implications
Practical Implications:
- Safe Real-world Deployment: Test-time observation intervention provides an effective plug-in module to existing visual MPC pipelines, markedly improving robustness to novel distractors without retraining. This is highly relevant for domestic robots, warehouse automation, and any system regularly encountering cluttered, dynamic environments.
- VLM-driven Causal Reasoning: Integrating VLMs for both distractor identification and action plan verification facilitates generalization across new scenes with minimal human supervision or bespoke engineering.
- Modularity: The method is agnostic to the world model or inpainting/segmentation backbone, easing adoption across varied robotics pipelines.
Theoretical Implications:
- Limits of Training-time Augmentation: The results reinforce the view that no finite training distribution can cover the boundless variety of distractors encountered in unstructured environments, thus necessitating adaptive test-time compensation.
- Latent Manifold Stability: The observed degradation of unseen visual features during autoregressive world model rollouts highlights the importance of “manifold alignment” during inference, not only during learning.
Potential Future Directions
- Dynamic Distractor Interventions: While this work focuses on static distractors, extending the pipeline to accommodate dynamic, interacting novel objects remains a critical next step.
- End-to-End Differentiable Interventions: Integrating the observation intervention mechanism within the world model’s learning and inference loop may yield more seamless adaptability.
- Multi-modal Sensor Fusion: Combining visual interventions with depth, tactile, or audio cues could further enhance robustness in complex or ambiguous settings.
- Realtime Performance Optimization: Reducing the compute cost of segmentation, inpainting, and VLM verification for deployment on embedded systems or edge hardware.
Summary Table of Core Results
| Metric |
Baseline (medium_gray) |
ReOI (org-purp-1) |
| SSIM (full obs) |
0.51 |
0.94 |
| LPIPS (full obs) |
0.17 |
0.04 |
| Task Success Rate |
0.20 |
0.70 |
| Collision Rate |
0.40 |
0.10 |
Conclusion
ReOI demonstrates that test-time observation interventions can substantially improve the reliability, consistency, and safety of world model-based robot planning under open-set, visually cluttered conditions. By algorithmically bridging the gap between training and deployment distributions, the approach enables practical deployment of model-based visual MPC in unconstrained environments without retraining or domain-specific engineering. This line of work opens promising directions for modular, VLM-in-the-loop adaptive perception and planning systems in robotics.