- The paper introduces Camera Depth Models (CDMs) that denoise raw depth maps using a dual-branch ViT encoder and cross-modal token fusion for accurate 3D perception.
- The paper demonstrates state-of-the-art performance in sim-to-real transfer, enabling policies trained in simulation to succeed on real-world robotic manipulation tasks.
- The paper validates CDMs with minimal latency and robust generalization across diverse camera sensors, reducing the need for extensive real-world data collection.
Accurate Geometry Perception for Robotic Manipulation via Camera Depth Models
Motivation and Problem Statement
Robotic manipulation in real-world environments fundamentally depends on accurate perception of 3D geometry. While recent advances in visuomotor policy learning have leveraged 2D RGB images, these approaches are limited by their inability to capture metric spatial relationships, leading to poor generalization across object shapes, sizes, and environmental conditions. Depth cameras, in principle, provide direct access to 3D geometry, but their outputs are plagued by characteristic noise, missing data (holes), and sensor-specific artifacts. This noise severely degrades the utility of depth data for manipulation, especially in tasks involving reflective, transparent, or slender objects. As a result, most prior work either restricts evaluation to simulation (where perfect depth is available) or applies aggressive downsampling and post-processing to mitigate noise, sacrificing geometric fidelity.
Camera Depth Models: Architecture and Data Synthesis
The paper introduces Camera Depth Models (CDMs), a family of neural models designed as plug-ins for commodity depth cameras. CDMs take as input an RGB image and a raw depth map from a specific camera and output a denoised, accurate metric depth map. The architecture is a dual-branch ViT-based encoder, with separate branches for RGB and depth, followed by a cross-modal token fusion module and a DPT-style decoder. This design enables the model to leverage both semantic cues from RGB and scale cues from the depth prompt, without requiring pre-processing such as hole-filling.
A critical challenge is the lack of high-quality, paired RGB-depth data for real cameras. The authors address this by constructing ByteCameraDepth, a large-scale dataset collected using a multi-camera rig with seven different depth cameras across ten modes, capturing over 170,000 RGB-depth pairs in diverse real-world scenes. To enable scalable training, they develop a neural data engine that learns to synthesize realistic camera-specific noise (both value and hole noise) from simulation data. The noise models are trained to mimic the statistical properties of real camera outputs, and a guided filter is introduced to address scale mismatches in synthesized noise, ensuring that the resulting depth maps preserve both local structure and global metric accuracy.
Model Training and Losses
CDMs are trained on synthesized data generated by applying the learned noise models to clean simulated depth maps. The training objective combines L1​ loss and a gradient loss to encourage both global accuracy and sharp depth discontinuities. The ViT encoders are initialized from DINOv2 weights, and the decoder is trained from scratch. The training corpus includes over 280,000 images from four open-source simulation datasets, with noise synthesized to match the characteristics of each target camera.
Sim-to-Real Transfer Pipeline
The core contribution is demonstrating that CDMs enable direct transfer of visuomotor policies trained in simulation (on clean depth) to real robots, without the need for domain randomization, noise injection, or real-world fine-tuning. The sim-to-real pipeline consists of:
- Scene Construction: Geometrically similar objects and backgrounds are constructed in simulation, with only approximate physical properties.
- Camera Alignment: Camera extrinsics are calibrated using differentiable rendering, and small pose randomizations are introduced to improve robustness.
- Data Generation: Demonstrations are generated in simulation using an extension of MimicGen with whole-body control (WBCMimicGen), producing smooth, high-quality trajectories.
- Policy Learning: Policies are trained via imitation learning on single-view depth images, using a ResNet encoder and a diffusion head for action prediction. No noise is added to simulated depth during training.
At deployment, the CDM processes the real camera's RGB and depth images, producing a clean depth map that is fed to the policy. This approach eliminates the need for any post-processing or hand-crafted augmentations.
Experimental Results
CDMs are evaluated on the Hammer dataset, which contains real RGB-depth pairs from multiple sensors. CDMs achieve state-of-the-art performance in metric depth prediction, outperforming prior prompt-based methods (PromptDA, PriorDA) both with and without hole-filling. Notably, CDMs generalize well across different camera types, and the model trained on L515 noise even outperforms the D435-specific model on some D435 data splits. Zero-shot generalization to unseen sensors (e.g., Lucid Helios) is also observed, indicating that CDMs capture common noise patterns across devices.
Real-World Manipulation
In depth-only imitation learning tasks (e.g., stacking bowls, placing toothpaste in a cup), policies using CDM-processed depth achieve significantly higher success rates and generalize to unseen object sizes, compared to policies using raw depth. In sim-to-real experiments on long-horizon tasks (e.g., placing a bowl in a microwave and closing the door, manipulating a fork and plate), policies trained solely in simulation and deployed with CDMs achieve real-world success rates matching or exceeding simulation performance. Competing methods using raw depth or prior prompt-based models fail to generalize, with near-zero success rates in most cases.
Latency and Deployment
CDMs introduce minimal inference latency (∼0.15s per frame on a 4090 GPU), comparable to or faster than prior methods, and require no pre- or post-processing. This enables real-time deployment in closed-loop control.
Analysis, Limitations, and Implications
The results provide strong evidence that accurate, simulation-level depth perception is a critical enabler for robust, generalizable robotic manipulation. By bridging the sim-to-real geometry gap, CDMs allow for direct exploitation of large-scale simulation data, reducing the need for costly real-world data collection and manual domain adaptation. The architecture's reliance on both RGB and depth enables recovery from typical sensor failures, but the model can still be misled when the depth prompt is severely corrupted and RGB cues are insufficient (e.g., large reflective surfaces with ambiguous appearance).
The approach is currently limited to single-view depth, and performance may degrade in highly cluttered or occluded scenes where multi-view fusion would be beneficial. The generalization of CDMs to new camera types is promising but not guaranteed; retraining or fine-tuning may be required for novel sensors or extreme environments.
Future Directions
CDMs open several avenues for future research:
- Integration with 3D Foundation Models: By providing high-fidelity depth, CDMs can enable large-scale relabeling of RGB-D datasets, facilitating the training of generalist 3D robot policies.
- Multi-View and Multi-Modal Fusion: Extending CDMs to handle multi-camera setups or fuse with other modalities (e.g., tactile, force) could further improve robustness.
- Active Perception and Uncertainty Modeling: Incorporating uncertainty estimates from CDMs into downstream planning and control could enable more reliable manipulation in safety-critical settings.
- Hardware-Software Co-Design: The plug-in nature of CDMs suggests opportunities for co-designing camera hardware and neural post-processing for optimal end-to-end performance.
Conclusion
This work demonstrates that camera-specific neural depth models can effectively bridge the sim-to-real gap in geometry perception, enabling direct transfer of depth-only visuomotor policies from simulation to real robots. The approach achieves strong empirical results on challenging manipulation tasks, highlighting the centrality of accurate 3D perception for generalizable robot learning. CDMs provide a practical, scalable solution for leveraging simulation data in real-world robotics, with broad implications for the development of robust, foundation-level manipulation policies.