Robust visual understanding of complex 3D scenes for embodied agents

Establish robust visual scene understanding capabilities for agents acting in 3D virtual environments, enabling accurate perception and interpretation of complex 3D layouts, on-screen cues, text, and menus to support reliable goal-directed behavior across diverse worlds.

Background

SIMA 2 leverages Gemini’s multimodal understanding but still relies on OCR and heuristics for programmatic evaluations, and struggles in visually abstract or highly complex settings (e.g., some Minecraft tasks).

The Discussion explicitly frames robust visual understanding of complex 3D scenes as an open challenge for the field, highlighting the need for improved perception and interpretation to generalize reliably across diverse visual configurations and interfaces.

References

Finally, executing precise, low-level actions via the keyboard-and-mouse interface and achieving robust visual understanding of complex 3D scenes remain open challenges that the entire field continues to work to address.

SIMA 2: A Generalist Embodied Agent for Virtual Worlds (2512.04797 - Team et al., 4 Dec 2025) in Discussion