Robust visual understanding of complex 3D scenes for embodied agents
Establish robust visual scene understanding capabilities for agents acting in 3D virtual environments, enabling accurate perception and interpretation of complex 3D layouts, on-screen cues, text, and menus to support reliable goal-directed behavior across diverse worlds.
Sponsor
References
Finally, executing precise, low-level actions via the keyboard-and-mouse interface and achieving robust visual understanding of complex 3D scenes remain open challenges that the entire field continues to work to address.
— SIMA 2: A Generalist Embodied Agent for Virtual Worlds
(2512.04797 - Team et al., 4 Dec 2025) in Discussion