EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device (2509.17430v2)

Published 22 Sep 2025 in cs.CV and cs.RO

Abstract: The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87-0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Project page: https://gchhablani.github.io/embodied-splat.

Summary

The paper introduces a novel method that uses 3D Gaussian splats from mobile captures to generate simulation-ready mesh reconstructions.
The paper demonstrates that fine-tuning on personalized meshes significantly boosts navigation performance compared to zero-shot approaches.
The paper’s approach reduces cost and complexity by utilizing consumer-grade devices, making sim-to-real transfers scalable for diverse robotic applications.

This technical essay provides an overview of the "EmbodiedSplat" method, which aims to streamline the implementation of sim-to-real navigation systems in embodied AI. The authors propose leveraging Gaussian Splats from mobile phone captures to create navigational proxies in simulated environments. This document consolidates the methodology, experiments, and potential implications within Embodied AI domains as described in the paper.

Introduction

Embodied AI traditionally encounters challenges when transferring skills acquired in high-fidelity simulations to unpredictable real-world situations. The sim-to-real gap is primarily due to differences in simulation fidelity and real-world scene complexities that are difficult to fully capture using high-resolution but cumbersome capture devices. The paper introduces EmbodiedSplat as a methodology to bridge this gap effectively, utilizing 3D Gaussian Splatting (GS) for mesh reconstruction derived from iPhone captures. This approach personalizes policy training, tuning it to specific scenes to optimize navigation policies within real-world settings.

Figure 1: Overview of EmbodiedSplat showing the sim-to-real transfer via 3D Gaussian Splatting using mobile phone captures.

Methodology

The EmbodiedSplat framework integrates four key stages: scene capture, 3D Gaussian Splat training, ImageNav episode synthesis, and real-world policy deployment. The process begins with the use of consumer-grade smartphones to capture deployment environments, leading to rapid subsequent mesh reconstruction.

Scene Capture and Gaussian Splats

Using Polycam and Nerfstudio on iPhones, the pipeline initiates by capturing scenes that are converted into 3D Gaussian Splats. DN-Splatter is employed for mesh conversion through a depth-and-normal regularization process. The reconstructed meshes are integrated into Habitat-Sim for training, where episodic navigation tasks are synthesized to prepare the policy for real-world evaluation.

Figure 2: The EmbodiedSplat Pipeline illustrating integration and deployment processes.

Experimental Design

The effectiveness of EmbodiedSplat is validated through a series of experiments split into zero-shot evaluations, fine-tuning scenarios, and real-world applications.

Zero-shot and Fine-tuning

Initial zero-shot testing reveals that pre-trained models on large-scale datasets (HM3D, HSSD) display varying success rates when directly transferred to new environments. Notably, fine-tuning significantly enhances performance, particularly in complex and large-scale scenes divergent from typical training datasets.

(Figure 3) (Figure 4)

Figures 4 & 5: Zero-shot validation success rates for different pretrained policies.

Real-world Transfer

The paper includes real-world validation where fine-tuned policies are deployed using a Stretch robot in a real scene. Fine-tuning not only improves success rates over zero-shot baselines but also resonates with improved performance in practice, as depicted by their real-world evaluation.

Figure 5: Real-world success rates demonstrating improvement post-fine-tuning on respective meshes.

Discussion and Future Work

EmbodiedSplat's contributions highlight a significant reduction in the cost and complexity associated with gathering effective training environments for navigation policies. The methodology provides an efficient solution for personalized and scenario-specific sim-to-real transfers, possibly extending to various robotics domains such as manipulation and mobile assistance.

Future research directions could explore integrating Gaussian Splatting visual data directly into policy architectures or testing similar methodologies in broader embodied tasks including object manipulation and complex room interaction. This scalability from small rooms to extensive building environments marks a notable potential advancement for minor and large-scale embodied AI systems.

Conclusion

EmbodiedSplat represents a promising stride in personalized navigation policy training using low-cost yet high-quality mesh reconstructions from consumer devices. By aligning simulation and real-world performance more closely, it addresses traditional limitations in embodied AI, paving the way for diverse applications across robotic systems and tasks.