OceanGym: A Benchmark Environment for Underwater Embodied Agents (2509.26536v1)

Published 30 Sep 2025 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.RO

Abstract: We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal LLMs (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.

Summary

The paper introduces OceanGym, a simulation platform built on Unreal Engine 5.3 that benchmarks underwater embodied agents across perception and decision tasks.
It demonstrates significant performance gaps between state-of-the-art MLLMs and human experts, particularly in sonar integration and memory retention under low-visibility conditions.
The study emphasizes the need for enhanced multimodal fusion, adaptive memory architectures, and sim-to-real transfer methods to advance marine robotics.

OceanGym: A Benchmark Environment for Underwater Embodied Agents

Motivation and Benchmark Design

OceanGym establishes a comprehensive simulation platform for evaluating embodied agents in oceanic underwater environments, addressing the unique perceptual and decision-making challenges inherent to marine domains. Unlike terrestrial or aerial benchmarks, OceanGym models low-visibility, dynamic currents, and partial observability, requiring agents to integrate multimodal sensory data (RGB, sonar) and execute long-horizon tasks under uncertainty. The environment is constructed atop Unreal Engine 5.3, featuring realistic semantic regions (seabed, cliffs, pipelines, wreckage, energy infrastructure) and supports scalable depth and lighting conditions.

Figure 1: OceanGym provides a unified agent framework for eight real-world underwater scenarios, integrating language instructions, multimodal perception, and AUV control.

Eight task domains are included, spanning both perception (multi-view and context-based) and decision tasks (detection and tracking), with explicit support for memory-augmented agent architectures. The agent-environment interaction is formalized as a POMDP with contextual memory, where the agent receives synchronized RGB and sonar images from six directions, textual instructions, and maintains a sliding window memory of past actions and descriptions. The action space comprises both translational and rotational maneuvers, enabling realistic 3D navigation.

Figure 2: OceanGym tasks are divided into perception (multi-view, context-based) and decision tasks, reflecting operational requirements in marine robotics.

Agent Architecture and Task Formulation

The agent policy is parameterized by a Multi-modal LLM (MLLM), mapping instructions, observations, and memory to either perception responses or control actions. For perception tasks, the agent must identify and localize underwater objects from multi-view or sequential RGB (and optionally sonar) images. For decision tasks, the agent integrates sensory input, memory, and goal specifications to navigate towards targets or perform inspection, subject to temporal and spatial constraints.

Memory is explicitly modeled as a sliding window over the last $K$ steps, storing textual summaries and actions, with the perception module generating context-aware updates. The Markov property is preserved via an augmented hidden state $(s_t, m_t)$ , and the trajectory distribution is induced by the policy and memory-augmented transition dynamics.

Experimental Results and Analysis

Perception Task Performance

Experiments reveal substantial performance gaps between state-of-the-art MLLMs and human experts, especially under deep water, low-illumination conditions. In shallow water, Qwen2.5-VL-7B achieves the highest average accuracy (57.12%), while in deep water, all models degrade sharply (Qwen2.5-VL-7B: 28.48%, Minicpm-4.5: 27.35%). Human accuracy remains near 100% in shallow and 92.35% in deep water. Notably, the addition of sonar data does not consistently improve MLLM performance, in contrast to humans who leverage sonar effectively.

Figure 3: Human experts consistently benefit from sonar data, while MLLMs show limited gains even with reference examples in deep water environments.

Case analysis indicates that perception errors are primarily due to occlusions, multi-object scenes, and low illumination, with agents failing to disambiguate targets or maintain temporal consistency.

Decision Task Performance

Decision tasks further highlight the limitations of current MLLMs. GPT-4o-mini achieves the best average score in both shallow (18.4%) and deep water (14.8%), with other models trailing. Human performance is 100% in shallow and 69.6% in deep water. Several tasks yield zero scores for MLLMs, particularly when object size is small or time constraints are stringent. Failure cases are attributed to perception errors and memory forgetting, where agents lose track of visited locations or misinterpret observations.

Figure 4: Decision task failures are dominated by perception errors and memory lapses, especially in cluttered or low-visibility environments.

Scaling and Memory Transfer

Extended exploration time improves agent performance up to a plateau, indicating a scaling law for embodied agents. However, performance saturates due to limitations in perception, memory, and lack of intrinsic exploration strategies.

Figure 5: (Top) Performance increases with longer operation time but plateaus, reflecting scaling limits. (Bottom) Memory transfer improves decision-making, especially in cross-task scenarios under perceptual degradation.

Memory transfer experiments show that leveraging prior experience enhances decision-making, particularly in cross-task transfer under deep water conditions. Within-task transfer yields limited benefits in challenging environments, underscoring the need for adaptive memory retrieval mechanisms.

Limitations and Implications

OceanGym provides a high-fidelity, extensible testbed for underwater embodied agents, but several limitations persist. The simulation does not fully capture real-world oceanic factors such as currents, salinity, marine life, and sonar noise. Sonar simulation introduces artifacts, and the environment requires substantial computational resources (≥24GB GPU). Despite these constraints, OceanGym enables synthetic data generation, reinforcement learning, and sim-to-real transfer for AUV deployment, reducing reliance on hazardous field trials.

The persistent gap between MLLM-driven agents and human experts in perception and decision-making tasks highlights fundamental challenges in robust multimodal fusion, spatial reasoning, and long-horizon memory. The inability of current MLLMs to exploit sonar data and maintain consistent strategies under uncertainty suggests that advances in multimodal representation learning, memory architectures, and adaptive planning are required.

Future Directions

OceanGym sets the stage for research in robust embodied AI for marine applications. Future work should focus on:

Enhancing physical realism via generative models and physics-informed simulation.
Improving sonar and multimodal sensor fusion for perception under extreme conditions.
Developing scalable memory and retrieval mechanisms for long-horizon tasks.
Optimizing agent architectures for edge deployment in resource-constrained environments.
Bridging sim-to-real gaps for reliable transfer to real-world AUVs.

Conclusion

OceanGym introduces a rigorous benchmark for underwater embodied agents, revealing significant limitations in current MLLMs and providing a foundation for advancing robust autonomous systems in marine robotics. The results underscore the need for continued research in multimodal perception, memory, and decision-making under uncertainty, with practical implications for ocean exploration, resource monitoring, and autonomous intervention.