Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots (2509.02530v1)

Published 2 Sep 2025 in cs.RO, cs.AI, and cs.CV

Abstract: Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces Camera Depth Models (CDMs) that denoise raw depth maps using a dual-branch ViT encoder and cross-modal token fusion for accurate 3D perception.
  • The paper demonstrates state-of-the-art performance in sim-to-real transfer, enabling policies trained in simulation to succeed on real-world robotic manipulation tasks.
  • The paper validates CDMs with minimal latency and robust generalization across diverse camera sensors, reducing the need for extensive real-world data collection.

Accurate Geometry Perception for Robotic Manipulation via Camera Depth Models

Motivation and Problem Statement

Robotic manipulation in real-world environments fundamentally depends on accurate perception of 3D geometry. While recent advances in visuomotor policy learning have leveraged 2D RGB images, these approaches are limited by their inability to capture metric spatial relationships, leading to poor generalization across object shapes, sizes, and environmental conditions. Depth cameras, in principle, provide direct access to 3D geometry, but their outputs are plagued by characteristic noise, missing data (holes), and sensor-specific artifacts. This noise severely degrades the utility of depth data for manipulation, especially in tasks involving reflective, transparent, or slender objects. As a result, most prior work either restricts evaluation to simulation (where perfect depth is available) or applies aggressive downsampling and post-processing to mitigate noise, sacrificing geometric fidelity.

Camera Depth Models: Architecture and Data Synthesis

The paper introduces Camera Depth Models (CDMs), a family of neural models designed as plug-ins for commodity depth cameras. CDMs take as input an RGB image and a raw depth map from a specific camera and output a denoised, accurate metric depth map. The architecture is a dual-branch ViT-based encoder, with separate branches for RGB and depth, followed by a cross-modal token fusion module and a DPT-style decoder. This design enables the model to leverage both semantic cues from RGB and scale cues from the depth prompt, without requiring pre-processing such as hole-filling.

A critical challenge is the lack of high-quality, paired RGB-depth data for real cameras. The authors address this by constructing ByteCameraDepth, a large-scale dataset collected using a multi-camera rig with seven different depth cameras across ten modes, capturing over 170,000 RGB-depth pairs in diverse real-world scenes. To enable scalable training, they develop a neural data engine that learns to synthesize realistic camera-specific noise (both value and hole noise) from simulation data. The noise models are trained to mimic the statistical properties of real camera outputs, and a guided filter is introduced to address scale mismatches in synthesized noise, ensuring that the resulting depth maps preserve both local structure and global metric accuracy.

Model Training and Losses

CDMs are trained on synthesized data generated by applying the learned noise models to clean simulated depth maps. The training objective combines L1L_1 loss and a gradient loss to encourage both global accuracy and sharp depth discontinuities. The ViT encoders are initialized from DINOv2 weights, and the decoder is trained from scratch. The training corpus includes over 280,000 images from four open-source simulation datasets, with noise synthesized to match the characteristics of each target camera.

Sim-to-Real Transfer Pipeline

The core contribution is demonstrating that CDMs enable direct transfer of visuomotor policies trained in simulation (on clean depth) to real robots, without the need for domain randomization, noise injection, or real-world fine-tuning. The sim-to-real pipeline consists of:

  • Scene Construction: Geometrically similar objects and backgrounds are constructed in simulation, with only approximate physical properties.
  • Camera Alignment: Camera extrinsics are calibrated using differentiable rendering, and small pose randomizations are introduced to improve robustness.
  • Data Generation: Demonstrations are generated in simulation using an extension of MimicGen with whole-body control (WBCMimicGen), producing smooth, high-quality trajectories.
  • Policy Learning: Policies are trained via imitation learning on single-view depth images, using a ResNet encoder and a diffusion head for action prediction. No noise is added to simulated depth during training.

At deployment, the CDM processes the real camera's RGB and depth images, producing a clean depth map that is fed to the policy. This approach eliminates the need for any post-processing or hand-crafted augmentations.

Experimental Results

Depth Prediction Performance

CDMs are evaluated on the Hammer dataset, which contains real RGB-depth pairs from multiple sensors. CDMs achieve state-of-the-art performance in metric depth prediction, outperforming prior prompt-based methods (PromptDA, PriorDA) both with and without hole-filling. Notably, CDMs generalize well across different camera types, and the model trained on L515 noise even outperforms the D435-specific model on some D435 data splits. Zero-shot generalization to unseen sensors (e.g., Lucid Helios) is also observed, indicating that CDMs capture common noise patterns across devices.

Real-World Manipulation

In depth-only imitation learning tasks (e.g., stacking bowls, placing toothpaste in a cup), policies using CDM-processed depth achieve significantly higher success rates and generalize to unseen object sizes, compared to policies using raw depth. In sim-to-real experiments on long-horizon tasks (e.g., placing a bowl in a microwave and closing the door, manipulating a fork and plate), policies trained solely in simulation and deployed with CDMs achieve real-world success rates matching or exceeding simulation performance. Competing methods using raw depth or prior prompt-based models fail to generalize, with near-zero success rates in most cases.

Latency and Deployment

CDMs introduce minimal inference latency (∼0.15s per frame on a 4090 GPU), comparable to or faster than prior methods, and require no pre- or post-processing. This enables real-time deployment in closed-loop control.

Analysis, Limitations, and Implications

The results provide strong evidence that accurate, simulation-level depth perception is a critical enabler for robust, generalizable robotic manipulation. By bridging the sim-to-real geometry gap, CDMs allow for direct exploitation of large-scale simulation data, reducing the need for costly real-world data collection and manual domain adaptation. The architecture's reliance on both RGB and depth enables recovery from typical sensor failures, but the model can still be misled when the depth prompt is severely corrupted and RGB cues are insufficient (e.g., large reflective surfaces with ambiguous appearance).

The approach is currently limited to single-view depth, and performance may degrade in highly cluttered or occluded scenes where multi-view fusion would be beneficial. The generalization of CDMs to new camera types is promising but not guaranteed; retraining or fine-tuning may be required for novel sensors or extreme environments.

Future Directions

CDMs open several avenues for future research:

  • Integration with 3D Foundation Models: By providing high-fidelity depth, CDMs can enable large-scale relabeling of RGB-D datasets, facilitating the training of generalist 3D robot policies.
  • Multi-View and Multi-Modal Fusion: Extending CDMs to handle multi-camera setups or fuse with other modalities (e.g., tactile, force) could further improve robustness.
  • Active Perception and Uncertainty Modeling: Incorporating uncertainty estimates from CDMs into downstream planning and control could enable more reliable manipulation in safety-critical settings.
  • Hardware-Software Co-Design: The plug-in nature of CDMs suggests opportunities for co-designing camera hardware and neural post-processing for optimal end-to-end performance.

Conclusion

This work demonstrates that camera-specific neural depth models can effectively bridge the sim-to-real gap in geometry perception, enabling direct transfer of depth-only visuomotor policies from simulation to real robots. The approach achieves strong empirical results on challenging manipulation tasks, highlighting the centrality of accurate 3D perception for generalizable robot learning. CDMs provide a practical, scalable solution for leveraging simulation data in real-world robotics, with broad implications for the development of robust, foundation-level manipulation policies.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube