SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation

Published 26 Jun 2026 in cs.RO | (2606.28276v1)

Abstract: Training and evaluating robot policies in the real world is costly and difficult to scale. We introduce SimFoundry, a modular and automated system for zero-shot real-to-sim scene construction from a video. SimFoundry generates sim-ready digital twins and supports object, scene, and task editing, enabling the automated generation of diverse digital cousins: affordance-preserving variations of reconstructed real-world scenes. Policies trained on SimFoundry data transfer zero-shot to challenging real tasks involving multi-step manipulation, articulated object interaction, and bimanual interaction, and its digital cousins (variations of the original scene, objects, and tasks) facilitate generalization to new real-world conditions. Across 7 manipulation tasks and 5 policy architectures, SimFoundry simulation evaluations strongly predict real-world performance, with mean Pearson correlation 0.911 and mean maximum ranking violation 0.018. When evaluating sim-trained policies zero-shot in the real world, policies trained with object, scene, and task cousins in simulation show average task success rate improvements of 17%, 21%, and 40%, respectively. Additional details at https://research.nvidia.com/labs/gear/simfoundry/ .

Abstract PDF Upgrade to Chat

Authors (18)

First 10 authors:

Summary

The paper introduces a modular pipeline that extracts, generates, and augments scene data from real-world video for policy learning.
It integrates state-of-the-art VLMs, 2D-to-3D mesh generation, physics simulation, and human-in-the-loop refinement to ensure high geometric fidelity and sim-to-real transfer.
Empirical results demonstrate significant performance improvements in simulated and real-world tasks, confirming better policy generalization through structured data augmentation.

SimFoundry: Automated Modular Scene Generation for Policy Learning and Evaluation

System Architecture and Methodology

SimFoundry introduces a modular pipeline for constructing interactive, sim-ready digital twins from a single real-world video input. The architecture decomposes scene generation into three distinct stages: Extraction, Generation, and Augmentation. This modularity enables seamless integration of state-of-the-art VLMs and mesh generation tools, providing flexibility for leveraging improvements in perception, asset creation, pose alignment, and physics annotation.

Extraction leverages RGB-D estimation and segmentation models to produce object-level masks, depth maps, and point clouds. Iterative foreground removal and inpainting isolate each object for downstream mesh generation. The Generation stage synthesizes per-object meshes via 2D-to-3D models, aligns them to reconstructed point clouds, produces collision geometries, and annotates physics properties. Articulated objects are processed via joint type and location inference, part segmentation, and URDF compilation using VLM-guided APIs. The system ensures scene stability through depenetration and physics-based settling in PyBullet, followed by export to robotics simulators (IsaacLab, OmniGibson) for policy integration.

Augmentation systematically generates digital cousins—affordance-preserving scene variants—along three axes: object instance, scene layout, and task specification. Object cousins diversify geometry, appearance, and topology while preserving functional affordances. Scene cousins use spatial predicates and distractor sampling to create structured layout variations. Task cousins exploit contextual semantics and robot constraints to propose goal-conditioned manipulation tasks compatible with the reconstructed scene, enabling scalable procedural data generation.

Figure 1: Overview of the SimFoundry pipeline, encompassing extraction, mesh generation, physical annotation, and structured augmentation across objects, scenes, and tasks.

Scene Generation and Diversity

SimFoundry reconstructs digital twins and produces plausible cousins with meaningful physical affordances and semantic richness. The pipeline supports articulated object modeling and photorealistic backgrounds via Gaussian Splatting. Two background reconstruction strategies (fully automated via inpainting and metric alignment, manual foreground removal and interactive registration) allow tradeoffs between capture effort, reproducibility, and surface fidelity.

Generated cousins are scene-consistent and realistic, verified via VLM prompts for component decomposition, topology feasibility, and scene-aware image editing. The system filters out implausible variants, guaranteeing semantic integrity and real-world plausibility.

Figure 2: Real-world inputs (top), reconstructed digital twins (middle), and sampled digital cousin variants (bottom), illustrating geometry and layout diversity.

Policy Benchmarking and Evaluation

SimFoundry enables both real-to-sim policy evaluation and sim-to-real policy training. Extensive benchmarks on DROID (single-arm) and YAM (bimanual) platforms demonstrate high-fidelity scene replication and strong correlation between simulated and real-world policy success rates. Seven manipulation tasks spanning multi-step, articulated-object, and bimanual coordination are evaluated. Simulation-based policy evaluations achieve a mean Pearson correlation of 0.911 and mean MMRV of 0.018, a substantial improvement over previous frameworks such as PolaRiS (Jain et al., 18 Dec 2025).

Sub-task evaluations further enhance correlation, especially for long-horizon tasks, and enable fine-grained identification of performance bottlenecks in complex policy deployments. SimFoundry supports evaluation protocols with strict initial condition sampling and spatial distribution alignment between sim and real environments.

Figure 3: Task roster and real-to-sim policy evaluation correlations, highlighting strong agreement between simulation and real-world results across policy architectures and manipulation complexities.

Data Augmentation and Policy Generalization

SimFoundry-generated data exhibits significant policy robustness and generalization, outperforming the digital twin-only baseline. Object cousins yield an average zero-shot sim-to-real success improvement of 17% and up to 50% gain on held-out objects. Scene cousins enhance layout generalization by 21% on average, with up to 28% improvement on challenging tasks. Task cousins amplify downstream learning, facilitating intra-task transfer and boosting success rates by up to 60% in simulation and 40% in real-world settings.

Sim-and-real co-training further elevates performance, combining scalable synthetic demonstrations with limited real-world data. Policies trained with both sources reach 92% success on Store Marker (up from 60% real-only), and achieve 36% improvement in simulated Throw Away Trash.

Figure 4: Data diversity ablation, quantifying the impact of object, scene, and task cousin augmentation on policy performance across DROID and YAM platforms.

Figure 5: Detailed breakdown of success rates and generalization gains attributable to structured data augmentation along object, scene, task, and sim/real co-training axes.

SimFoundry incorporates articulated object generation using VLM-based part segmentation and joint synthesis. Accurately segmentation and assignment of mesh components is achieved through iterative prompting and human refinement via interactive pose editors. Gaussian Splatting pipelines (both automated and manual) maintain photorealistic backgrounds, ensuring high geometric and visual fidelity.

The interactive scene editor provides operators with GUI-based controls for pose and scale adjustment, facilitating precision tuning of object alignment in occluded or cluttered contexts with minimal time overhead.

Figure 6: Pipeline for articulated object generation and 3D Gaussian Splat background modeling.

Figure 7: Interactive scene editor workflow for iterative pose refinement and mesh alignment.

Empirical Results and System Scalability

SimFoundry demonstrates superior geometric fidelity compared to SAM3D (Team et al., 20 Nov 2025), achieving F1 scores of 0.81–0.92 (zero-shot) and 0.93–0.99 (with 3 minutes per-object tuning), with significantly lower chamfer distances and bounding box errors. Scene reconstruction scales linearly with object count, averaging 5 minutes per object, and produces robust environments suitable for large-scale policy benchmarking and training.

Tasks such as Stack Dishware, Store Marker, and Throw Away Trash are benchmarked with detailed spatial initialization grids, real-to-sim evaluation metrics (Pearson $r$ , MMRV), and quantitative breakdowns of performance across twin-only, cousin-augmented, and co-trained policies.

Figure 8: Task grid sampling for Stack Dishware, providing spatial diversity for policy benchmarking.

Limitations and Future Directions

SimFoundry's modularity ensures compatibility with evolving foundation models, but inherits their limitations, including non-deterministic outputs and fidelity constraints for monocular input. Current assumptions restrict the pipeline to tabletop-style layouts; relaxing these for multi-level environments and broader object categories is a natural extension.

Automated background modeling incurs computational overhead (e.g., two-pass inpainting), but this can be mitigated via multi-GPU parallelism. Articulation quality is bounded by mesh segmentation accuracy, especially for occluded internal structures or non-rigid parts.

Conclusion

SimFoundry delivers a modular, fully automated real-to-sim pipeline for robot policy learning and evaluation, with demonstrated efficacy across diverse manipulation regimes, articulated objects, and policy architectures. The system achieves high geometric fidelity, strong sim-to-real transfer, and actionable policy benchmarking enabled by structured data augmentation. Its automated scene generation, task diversity, and co-training capabilities position SimFoundry as an effective platform for scalable policy development, evaluation, and generalization in robotic manipulation. The integration of advanced VLMs, mesh generators, articulation models, and human-in-the-loop refinement potentiates further evolution toward more complex and dynamic real-world environments.

References

SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation (2606.28276)
PoLaRiS: Scalable Real-to-Sim Evaluations for Generalist Robot Policies (Jain et al., 18 Dec 2025)
SAM3D: 3Dfy Anything in Images (Team et al., 20 Nov 2025)
DROID: A large-scale in-the-wild robot manipulation dataset (Khazatsky et al., 2024)
Hunyuan3D 2.1: High-Fidelity 3D Asset Generation (Hunyuan3D et al., 18 Jun 2025)
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [17868--17879]
MimicGen: Data Generation System for Scalable Robot Learning (Mandlekar et al., 2023)
Articulate Anything: Automatic Modeling of Articulated Objects (Le et al., 2024)