Papers
Topics
Authors
Recent
Search
2000 character limit reached

EmbodiedSplat: Real-to-Sim Transfer Pipeline

Updated 26 February 2026
  • EmbodiedSplat is an integrated pipeline for real-to-sim-to-real transfer that uses mobile 3D capture and 3D Gaussian Splatting to reconstruct high-fidelity environments.
  • It converts captured data into detailed simulation scenes in Habitat-Sim, enabling effective policy pre-training and fine-tuning for navigation tasks.
  • The approach significantly improves navigation success rates by personalizing policies to scene-specific data, reducing the sim-to-real gap.

EmbodiedSplat is an end-to-end pipeline for real-to-sim-to-real transfer in Embodied AI, leveraging fast and casual 3D scene capture from mobile devices, high-fidelity 3D Gaussian Splatting reconstruction, and Habitat-Sim simulation to personalize navigation policies for real robot deployment. The methodology is specifically designed to address sim-to-real transfer gaps in navigation tasks by providing a low-cost yet geometrically and visually accurate reconstruction of in situ deployment environments, enabling fine-tuning of agents with scene-specific data and yielding significant improvements in policy transferability and success rates (Chhablani et al., 22 Sep 2025).

1. Pipeline Architecture and Workflow

EmbodiedSplat consists of four major stages: scene capture, 3D Gaussian Splatting-based reconstruction, simulation environment instantiation, and policy adaptation and deployment. The workflow is as follows:

  • Capture: An operator traverses the target scene with a commodity iPhone (e.g., iPhone 13 Pro Max) using Polycam, generating aligned RGB-D video streams, camera intrinsics and extrinsics, and optionally a preliminary mesh.
  • 3D Gaussian Splatting Reconstruction: Input data is fed to DN-Splatter, a depth-and-normal regularized 3D Gaussian Splatting pipeline. After ~30,000 gradient steps, a radiance field comprised of 3D Gaussians encodes the scene and is transformed into a watertight triangle mesh via Poisson surface reconstruction, producing .ply and .glb assets.
  • Simulation Instantiation: The reconstructed .glb mesh is loaded into Habitat-Sim. A navigation mesh (navmesh) is computed over the largest floor island, and task episodes (e.g., 1,000 training and 100 validation for ImageNav) are sampled via random, reachable start-goal pairs.
  • Policy Pre-training and Fine-tuning: Two off-the-shelf navigation policies (HM3D- and HSSD-pretrained) are fine-tuned on the reconstructed environment for 20 million steps with adapted learning rates. Agents are then deployed on real robots (e.g., Hello Robot Stretch) and evaluated on hand-placed episodes (Chhablani et al., 22 Sep 2025).

2. 3D Gaussian Splatting and Scene Reconstruction

EmbodiedSplat employs 3D Gaussian Splatting (GS) as its core scene representation. The scene is modeled as a set of NN anisotropic Gaussian splats, each described by a mean vector μn\mu_n, covariance Σn\Sigma_n, weight wnw_n, and color cnc_n. The volumetric fields are defined as: ρ(x)=i=1Nwiexp(12(xμi)TΣi1(xμi))\rho(x) = \sum_{i=1}^N w_i \exp \left( -\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i) \right)

C(x)=i=1Nwiciexp(12(xμi)TΣi1(xμi))C(x) = \sum_{i=1}^N w_i c_i \exp \left( -\frac{1}{2}(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i) \right)

Rendering proceeds via ray marching: I=0T(t)σ(r(t))C(r(t))dtI = \int_0^\infty T(t)\, \sigma(r(t))\, C(r(t))\,dt where T(t)=exp(0tσ(r(s))ds)T(t) = \exp\left(-\int_0^t \sigma(r(s))\, ds \right), and σρ\sigma \equiv \rho.

DN-Splatter incorporates two regularization losses to improve accuracy and smoothness:

  • Depth regularization: Ld=DpredDsensor1L_d = \| D_{pred} - D_{sensor} \|_1 (with λd=0.2\lambda_d=0.2)
  • Normal smoothness: Ln=NpredNmono1L_n = \| N_{pred} - N_{mono} \|_1, where NmonoN_{mono} is derived from Metric3D-V2. A smoothness prior on Σ\Sigma and a color rendering loss are also used. After optimization, Poisson surface reconstruction yields a mesh suitable for navigation (Chhablani et al., 22 Sep 2025).

3. Simulation Environment and Task Setup

The photorealistic mesh is imported into Habitat-Sim for episodic navigation task composition:

  • Observation space: 640×480 RGB images at 42° FOV, with a goal-image sensor mirroring the main camera.
  • Action space: Discrete set {MOVE_FORWARD (0.25m),TURN_LEFT (10°),TURN_RIGHT (10°),STOP}\{\text{MOVE\_FORWARD (0.25\,m)}, \text{TURN\_LEFT (10°)}, \text{TURN\_RIGHT (10°)}, \text{STOP}\}.
  • Agent specification: A 1.41\,m cylinder with an RGB camera at 1.31\,m height.
  • Navmesh sampling: Non-trivial start-goal pairs sampled with at least 1\,m separation and guaranteed reachability.
  • Photorealism: Vertex-colored meshes provide real-world texture fidelity. No explicit domain randomization is performed, but PPO data augmentations (color jitter, random crops) are applied to the encoder (Chhablani et al., 22 Sep 2025).

4. Policy Pre-training, Fine-tuning, and Deployment

Pre-trained models are based on Habitat-Matterport 3D (HM3D) and Habitat Synthetic Scene Dataset (HSSD):

  • Architectures: VC-1-Base image encoders, 2-layer LSTM policy heads.
  • Learning algorithm: Distributed Decentralized PPO (DD-PPO), AdamW optimizer with a weight decay of c=106c=10^{-6} and PPO objective:

LPPO=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]cE[θ2]L^{PPO} = \mathbb{E}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\right] - c\,\mathbb{E}\left[\| \theta \|^2\right]

where rt(θ)=πθ(atst)/πθold(atst)r_t(\theta) = \pi_\theta(a_t|s_t)/\pi_{\theta_{old}}(a_t|s_t), ϵ=0.2\epsilon=0.2.

  • Reward function:

rt=cs[dt<rgat=STOP]+ca[θt<θgat=STOP]+(dt1dt)+(t1t)λccoll[collision]r_t = c_s \cdot [d_t < r_g \wedge a_t = \text{STOP}] + c_a \cdot [\theta_t < \theta_g \wedge a_t = \text{STOP}] + (d_{t-1} - d_t) + (\angle_{t-1} - \angle_t) - \lambda - c_{coll} \cdot [\text{collision}]

with cs=5.0c_s=5.0, ca=5.0c_a=5.0, rg=1.0r_g=1.0\,m, θg=25\theta_g=25^\circ, λ=0.01\lambda=0.01, ccoll=0.03c_{coll}=0.03.

  • Fine-tuning strategy: Learning rates of 2.5×1062.5 \times 10^{-6} (LSTM) and 6×1076 \times 10^{-7} (visual encoder), 20M steps, 2 epochs per update, 2 mini-batches, 64-frame rollouts, 8–10 envs/GPU. The encoder remains unfrozen for adaptation.
  • Deployment: The fine-tuned policy is hosted on a remote inference server and executed on the Stretch robot, evaluated with 10 physical trials per scene (Chhablani et al., 22 Sep 2025).

5. Experimental Results and Sim-to-Real Evaluation

EmbodiedSplat's evaluation encompasses both simulated and real-world environments for four hand-captured scenes (e.g., “castleberry”, “piedmont”, “classroom”, “grad_lounge”):

  • Datasets: HM3D (800/100 train/val splits, 10k episodes/scene), HSSD (134/33 train/val), captured scenes (1000/100 train/val).
  • Navigation evaluation:
    • Simulation: Max 1000 steps per episode; success if STOP within 1\,m of goal.
    • Real-world: Max 100 steps per episode; success if within 1\,m of goal.
  • Key performance metrics and results:
Setting Sim SR (avg) Real SR (avg) Fine-tune Δ (real) Sim–real corr (ρsim,real)(\rho_{sim,real})
Zero-shot HM3D 60% 50% N/A 0.87–0.97
Zero-shot HSSD 20–30% 10% N/A 0.87–0.97
HM3D fine-tuned ≥90% 70% (+20 pts) Yes 0.87–0.97
HSSD fine-tuned ≥80% 50% (+40 pts) Yes 0.87–0.97

Sim-real correlation is computed via

ρsim,real=Cov(S,R)σSσR\rho_{sim,real} = \frac{\mathrm{Cov}(S, R)}{\sigma_S \sigma_R}

with SS and RR the vectors of simulator and real success rates, respectively.

Fine-tuned policies achieve an absolute improvement of 20% (HM3D) and 40% (HSSD) in real-world success rate on Image Navigation tasks compared to their zero-shot performance (Chhablani et al., 22 Sep 2025).

6. Analysis, Implications, and Limitations

  • Reconstruction Quality: Success rate (SR) is positively correlated with mesh PSNR and negatively correlated with scene scale. Stabilized gimbal captures (e.g., MuSHRoom) outperform casual hand-held captures.
  • Pre-training: HM3D pre-trained policies provide superior zero-shot performance (∼60%) over HSSD (∼20%), but both benefit comparably from fine-tuning.
  • Training Strategy: Overfitting a policy from scratch on a single mesh yields high simulation SR but poor real-world generalization; fine-tuning on the reconstruction achieves the best balance between specificity and generalization.
  • Over-specialization: Excessive continued pre-training reduces zero-shot generalization to novel environments, indicating an overfitting effect.
  • Limitiations: Artifacts in DN meshes (holes, mismatched lighting) can hinder visual matching; Polycam outputs richer textures but less geometric regularization. No explicit domain randomization is performed in GS-based reconstructions, representing an open direction for further exploration. Opportunities include direct GS integration with the policy observation pipeline and extension to additional embodied AI tasks (Chhablani et al., 22 Sep 2025).

7. Broader Context and Future Directions

EmbodiedSplat demonstrates the feasibility of rapid, personalized robot policy adaptation using only commodity hardware and scalable simulation, significantly narrowing the sim-to-real gap for navigation. Integrating GS reconstruction with domain randomization, leveraging advanced scene semantics, and automating data collection/stabilization are seen as promising extensions. Potential generalizations to object-goal navigation, embodied rearrangement, and manipulation are suggested as future work (Chhablani et al., 22 Sep 2025). Recent related frameworks, such as SplatR for rearrangement with 3DGS representations (S et al., 2024) and hybrid differentiable real-to-sim pipelines for manipulation and pose calibration (Moran et al., 4 Jun 2025), further contextualize EmbodiedSplat within the landscape of fast-adaptive, scene-specific Embodied AI methodologies.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EmbodiedSplat.