In-depth Understanding and Embodied Decision-Making on Generated 3D Scenes

Investigate how to perform in-depth scene understanding tasks and adapt embodied decision-making based on high-quality generated 3D scenes, determining effective methodologies that enable agents to utilize such generated environments for downstream understanding and control.

Background

The paper introduces SceneMaker, a decoupled framework for open-set 3D scene generation that separates de-occlusion, object generation, and pose estimation to better learn open-set priors. Comprehensive experiments show improved geometry quality and pose accuracy, particularly under severe occlusion.

In the limitations discussion, the authors highlight that moving from scene generation to downstream usage by agents—specifically for deeper understanding and embodied decision-making—remains unresolved. This reflects the gap between generating high-quality 3D scenes and leveraging them effectively for tasks where agents must interpret and act within those scenes.

References

Moreover, how to perform more in-depth understanding tasks and adapt embodied decision-making based on generated high-quality 3D scenes is also an unsolved challenge.

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model  (2512.10957 - Shi et al., 11 Dec 2025) in Conclusion: Limitations and future work