- The paper introduces a pipeline that fuses VLM priors with interactive online adaptation to enable uncertainty-aware sim-to-real transfer in manipulation tasks.
- The methodology integrates real-to-sim mesh reconstruction, physics-conditioned policy learning, and inverse-variance fusion to balance visual estimates with interaction data.
- Experimental evaluations reveal improved success rates and reduced errors on planar pushing tasks compared to domain randomization and imitation learning baselines.
Phys2Real: Integrating VLM Priors and Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
Introduction
Phys2Real introduces a real-to-sim-to-real pipeline for robotic manipulation that fuses vision-LLM (VLM) priors with interactive online adaptation, enabling uncertainty-aware sim-to-real transfer for tasks requiring precise physical dynamics. The framework is motivated by the limitations of domain randomization (DR) and static system identification, which often fail to generalize to out-of-distribution object properties or adapt to novel physical configurations. Phys2Real leverages VLMs for initial physical parameter estimation and refines these estimates through interaction-based adaptation, using ensemble-based uncertainty quantification to fuse information sources. The approach is evaluated on planar pushing tasks involving objects with varying center of mass (CoM), demonstrating substantial improvements over DR and imitation learning baselines.
Figure 1: Phys2Real pipeline overview, illustrating real-to-sim mesh reconstruction, policy learning, and uncertainty-aware fusion of VLM priors with interaction-based adaptation.
Real-to-Sim Mesh Reconstruction
Phys2Real begins with a real-to-sim pipeline that reconstructs simulation-ready object meshes from video frames. Object segmentation is performed using SAM-2, followed by 3D Gaussian Splatting (GSplat) to generate high-fidelity geometric representations. Surface-aligned meshes are extracted using SuGaR, resulting in watertight assets suitable for physics simulation. This automated pipeline minimizes manual intervention and enables rapid creation of digital twins that capture both geometric and physical properties.
Figure 1: Real-to-sim mesh reconstruction pipeline from video frames to simulation-ready mesh using SAM-2, GSplat, and SuGaR.
Physics-Conditioned Policy Learning
Policy learning in Phys2Real is structured in three phases, inspired by Rapid Motor Adaptation (RMA):
- Phase 1: Train RL policies conditioned on ground-truth physical parameters (e.g., CoM) in simulation, enabling optimal behaviors for different object configurations.
- Phase 1.5: Fine-tune policies with noisy parameter estimates to build robustness against downstream estimation errors.
- Phase 2: Freeze policy weights and train an ensemble of adaptation models to predict physical parameters from historical state-action sequences. Ensemble variance quantifies epistemic uncertainty, while each model outputs aleatoric uncertainty via Gaussian negative log-likelihood.
Policies are trained using PPO with asymmetric actor-critic architectures, conditioning the actor on object pose, end-effector position, and physical parameters. The critic receives privileged observations. This explicit conditioning on interpretable physical parameters, rather than latent vectors, is critical for downstream fusion with VLM priors.
Figure 2: Phys2Real policy training phases, including ground-truth conditioning, robustness fine-tuning, and ensemble-based adaptation for uncertainty quantification.
Uncertainty-Aware Fusion of VLM Priors and Interaction-Based Estimates
Phys2Real addresses the challenge of uninformative interaction histories by fusing VLM-based visual estimates with adaptation model predictions using inverse-variance weighting. Both sources provide parameter estimates and associated uncertainties. The fusion mechanism dynamically balances reliance on VLM priors and interaction-based adaptation according to their respective uncertainties:
θ^=1/σvlm2​+1/σrma2​θvlm​/σvlm2​+θrma​/σrma2​​
This approach enables robust adaptation even during intermittent contact scenarios, where interaction data may be sparse or noisy.
Figure 3: VLM priors for task-relevant physical parameters, illustrating multi-view querying and uncertainty estimation for CoM.
Experimental Evaluation
Phys2Real is evaluated on two planar pushing tasks: a T-block with variable CoM (via repositioned metal weights) and a hammer with off-center mass distribution. Metrics include success rate, position error, orientation error, and task completion time. The evaluation addresses four key questions:
- Q1: Does accurate physical parameter estimation improve manipulation policy performance?
- Q2: Are VLM-estimated parameters sufficient for policy improvement?
- Q3: Is interaction-based adaptation alone sufficient?
- Q4: Can Phys2Real generalize to real-world objects without known meshes?
T-block Pushing
Phys2Real achieves 100% success rate and lowest position error when the weight is at the bottom of the T-block, outperforming DR (79%) and RMA-only (79%). In the challenging top-weight configuration, Phys2Real (57.14%) outperforms VLM-only (4.76%) and RMA-only (14.29%), but is surpassed by the privileged oracle (90.48%). CDF analysis of position errors shows Phys2Real maintains low error across percentiles, while DR and diffusion policy exhibit long tails of high error.

Figure 4: CDFs of position error for T-block pushing with weight at top and bottom, showing Phys2Real's consistently low error distribution.
Ablation studies reveal that neither VLM-only nor RMA-only conditioning is sufficient; both sources are necessary for robust performance. The fusion mechanism allows Phys2Real to adapt as interaction uncertainty decreases, converging toward ground truth CoM during contact and reverting to VLM priors when contact ceases.

Figure 5: Time series of CoM estimates during T-block pushing, illustrating dynamic fusion of VLM and RMA estimates as interaction progresses.
Hammer Pushing
For hammer pushing, both Phys2Real and DR achieve 100% success, but Phys2Real completes the task 15% faster on average, indicating more efficient trajectories. The mesh for the hammer is generated via the real-to-sim pipeline, demonstrating Phys2Real's applicability to objects without pre-existing models.
Robustness to VLM Estimation Error
Additional ablations show Phys2Real is robust to inaccurate VLM parameter estimates, maintaining high success rates even as VLM CoM deviates from ground truth. In contrast, policies trained with noisy parameter estimates (Phase 1.5) degrade significantly as VLM error increases. This highlights the advantage of uncertainty-aware fusion over naive conditioning.
Implications and Future Directions
Phys2Real demonstrates that integrating VLM priors with interactive adaptation enables robust, uncertainty-aware sim-to-real transfer for manipulation tasks requiring precise physical reasoning. The explicit conditioning on interpretable physical parameters facilitates dynamic fusion and adaptation, outperforming DR and imitation learning baselines. The approach generalizes to objects without known meshes via automated reconstruction pipelines.
Practically, Phys2Real enables more adaptive robotic systems capable of manipulating novel objects with unknown or variable physical properties. Theoretically, the work suggests that foundation models can be leveraged for low-level control when combined with uncertainty-aware adaptation, bridging the gap between high-level reasoning and physical interaction.
Future research may extend Phys2Real to asymmetric objects, integrate perception-based tracking in place of motion capture, and explore more complex manipulation tasks. The fusion of vision, language, and interaction-based physical reasoning represents a promising direction for general-purpose robotic manipulation.
Conclusion
Phys2Real presents a principled framework for sim-to-real transfer in robotic manipulation, combining VLM-inferred physical parameter priors with interactive online adaptation via uncertainty-aware fusion. The method achieves strong empirical results across multiple tasks and configurations, approaching privileged oracle performance without access to ground-truth physical information. The integration of visual geometry, physical understanding, and adaptive control sets a foundation for more general and robust robotic systems capable of manipulating diverse, novel objects in real-world environments.