- The paper introduces a novel method that uses 3D Gaussian splats from mobile captures to generate simulation-ready mesh reconstructions.
- The paper demonstrates that fine-tuning on personalized meshes significantly boosts navigation performance compared to zero-shot approaches.
- The paper’s approach reduces cost and complexity by utilizing consumer-grade devices, making sim-to-real transfers scalable for diverse robotic applications.
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device
This technical essay provides an overview of the "EmbodiedSplat" method, which aims to streamline the implementation of sim-to-real navigation systems in embodied AI. The authors propose leveraging Gaussian Splats from mobile phone captures to create navigational proxies in simulated environments. This document consolidates the methodology, experiments, and potential implications within Embodied AI domains as described in the paper.
Introduction
Embodied AI traditionally encounters challenges when transferring skills acquired in high-fidelity simulations to unpredictable real-world situations. The sim-to-real gap is primarily due to differences in simulation fidelity and real-world scene complexities that are difficult to fully capture using high-resolution but cumbersome capture devices. The paper introduces EmbodiedSplat as a methodology to bridge this gap effectively, utilizing 3D Gaussian Splatting (GS) for mesh reconstruction derived from iPhone captures. This approach personalizes policy training, tuning it to specific scenes to optimize navigation policies within real-world settings.
Figure 1: Overview of EmbodiedSplat showing the sim-to-real transfer via 3D Gaussian Splatting using mobile phone captures.
Methodology
The EmbodiedSplat framework integrates four key stages: scene capture, 3D Gaussian Splat training, ImageNav episode synthesis, and real-world policy deployment. The process begins with the use of consumer-grade smartphones to capture deployment environments, leading to rapid subsequent mesh reconstruction.
Scene Capture and Gaussian Splats
Using Polycam and Nerfstudio on iPhones, the pipeline initiates by capturing scenes that are converted into 3D Gaussian Splats. DN-Splatter is employed for mesh conversion through a depth-and-normal regularization process. The reconstructed meshes are integrated into Habitat-Sim for training, where episodic navigation tasks are synthesized to prepare the policy for real-world evaluation.
Figure 2: The EmbodiedSplat Pipeline illustrating integration and deployment processes.
Experimental Design
The effectiveness of EmbodiedSplat is validated through a series of experiments split into zero-shot evaluations, fine-tuning scenarios, and real-world applications.
Zero-shot and Fine-tuning
Initial zero-shot testing reveals that pre-trained models on large-scale datasets (HM3D, HSSD) display varying success rates when directly transferred to new environments. Notably, fine-tuning significantly enhances performance, particularly in complex and large-scale scenes divergent from typical training datasets.
(Figure 3) (Figure 4)
Figures 4 & 5: Zero-shot validation success rates for different pretrained policies.
Real-world Transfer
The paper includes real-world validation where fine-tuned policies are deployed using a Stretch robot in a real scene. Fine-tuning not only improves success rates over zero-shot baselines but also resonates with improved performance in practice, as depicted by their real-world evaluation.
Figure 5: Real-world success rates demonstrating improvement post-fine-tuning on respective meshes.
Discussion and Future Work
EmbodiedSplat's contributions highlight a significant reduction in the cost and complexity associated with gathering effective training environments for navigation policies. The methodology provides an efficient solution for personalized and scenario-specific sim-to-real transfers, possibly extending to various robotics domains such as manipulation and mobile assistance.
Future research directions could explore integrating Gaussian Splatting visual data directly into policy architectures or testing similar methodologies in broader embodied tasks including object manipulation and complex room interaction. This scalability from small rooms to extensive building environments marks a notable potential advancement for minor and large-scale embodied AI systems.
Conclusion
EmbodiedSplat represents a promising stride in personalized navigation policy training using low-cost yet high-quality mesh reconstructions from consumer devices. By aligning simulation and real-world performance more closely, it addresses traditional limitations in embodied AI, paving the way for diverse applications across robotic systems and tasks.