UnderwaterVLA: Autonomous Underwater VLA Systems

Updated 1 June 2026

UnderwaterVLA is a multimodal framework that fuses vision, language, and control modules to tackle challenges in submerged, dynamic environments.
The architecture employs a dual-brain design, with a cloud-based mission reasoning brain for high-level planning and a local reactive control brain for real-time execution.
The framework leverages both simulation-based and real-world datasets (USIM, UWBench, UVLM) to benchmark visual-language tasks and improve robotic autonomy.

UnderwaterVLA (Underwater Vision-Language(-Action) Architecture) encompasses a family of multimodal approaches, datasets, benchmarks, and control frameworks for autonomous operation and environmental understanding in underwater domains. This paradigm fuses vision, natural language, and control/action modules to address the unique challenges imposed by submerged environments, including hydrodynamic complexity, severe image degradation, constrained communication, and the need for high-level semantic reasoning. UnderwaterVLA research ranges from foundational datasets and end-to-end system architectures for robotic autonomy, through perception and mapping pipelines, to domain-specific video–language understanding benchmarks tailored for marine science and exploration.

1. Vision-Language-Action System Architectures for Underwater Robotics

The UnderwaterVLA framework introduces multimodal foundation models specifically adapted for underwater autonomous systems (Wang et al., 26 Sep 2025). The leading architectural scheme is a biologically inspired “dual-brain” split:

Mission Reasoning Brain (Cloud Brain): Executes high-level task decomposition, planning, and interpretable chain-of-thought (CoT) reasoning via foundation VLA models (e.g., QVQ-MAX). Receives natural language directives, fuses with periodic visual/sonar snapshots, and outputs a serialized sequence of discrete, short-horizon sub-goals $\{\mathcal{S}_1, \mathcal{S}_2, \ldots, \mathcal{S}_N\}$ and rationales.
Reactive Control Brain (Cerebellum): Onboard the AUV, this local branch parses sub-goals, refines their execution using local state (camera, IMU, DVL, pressure, sonar), and dispatches commands to a hydrodynamics-informed Model Predictive Control (MPC) core.

The central interface is formalized as: $\{\mathcal{S}_i\}_{i=1}^N = \Phi_{\mathrm{cloud}}(L,I), \qquad \mathbf{u}_t = \Phi_{\mathrm{local}}(\mathcal{S}_t, x_t)$ where $L$ is input language and $I$ is latest sensory imagery.

This dual-brain design enables robust operation under severe bandwidth limitations and high-latency acoustic communication, decoupling long-horizon semantic planning from time-critical, disturbance-resilient control loops (Wang et al., 26 Sep 2025).

2. Data Foundations: Datasets and Benchmarks

Two primary lines of work underpin UnderwaterVLA data infrastructure.

2.1 Simulation-Based VLA Datasets

USIM (Gu et al., 9 Oct 2025) is a simulation-based, multi-task dataset supporting vision-language-action training and evaluation for underwater robots. Generated in the Stonefish simulator with an instrumented BlueROV2 setup (5-DOF arm, 6 thrusters), it comprises:

561,260 frames (10 Hz)
1,852 trajectories (~15.6 hours)
20 instructed tasks (grasping, inspection, tracking, transport) across 9 scenarios featuring randomized object placement, turbidity, and sun angle.

A single data sample is structured: $\mathcal{D}_i = (I_{L,i}, I_{R,i}, a_i, p_i, v_i, u_i, q_i, \ell_i)$ with stereo images, IMU, pressure, DVL, normalized actions, manipulator state, and language instruction.

2.2 Real-Image Vision-Language Datasets

UWBench (Zhang et al., 21 Oct 2025) provides a human-verified corpus of 15,003 underwater images from reefs, open ocean, and deep-sea habitats, each annotated with:

15,281 object referring expressions (158 categories)
124,983 question-answer pairs (~8 per image)

Three core tasks are defined: detailed image captioning, visual grounding (bounding box prediction), and visual question answering. The benchmark exposes significant domain gaps (e.g., BLEU-4 for captioning falls ~40% vs. COCO; mean IoU in grounding, 80.18%) due to color attenuation, turbidity, and domain-specific visual context (Zhang et al., 21 Oct 2025).

UVLM (Xue et al., 3 Jul 2025) extends this paradigm to video–language, with 2,109 videos (~860k frames, 419 species, multiple habitats) and a task taxonomy comprising 20 biological and environmental content/change tasks. Specialized metrics such as Turbidity-Aware IoU and the composite UVLU score account for the underwater visibility challenge.

2.3 Visual-Inertial Localization and Mapping

AQUALOC (Ferrera et al., 2019, Ferrera et al., 2018) is a visual-inertial-pressure dataset for underwater SLAM. It consists of sequences from harbor and archaeological sites with synchronized monocular imagery, IMU, and depth, along with offline Structure-from-Motion ground truth. Standard evaluation metrics include Absolute Trajectory Error (ATE) and Relative Pose Error (RPE).

3. VLA Model Design and Adaptation

UnderwaterVLA systems typically employ pre-trained multimodal transformer backbones (e.g., Qwen-VL, QVQ-MAX) with underwater domain adaptation via prompt engineering or simulation-based fine-tuning (Wang et al., 26 Sep 2025, Gu et al., 9 Oct 2025).

Notable Model Variants

U0 (Gu et al., 9 Oct 2025): A VLA model integrating binocular vision, proprioception, and diffusion transformer fusion, leveraging a convolution-attention perception (CAP) module for spatial focus. CAP is used in training to predict egocentric target poses via convolution and attention mechanisms acting on VLM-derived visual tokens.

$\mathrm{Token} = \mathrm{VLM}(I_L,I_R,\ell)\,,\quad F = \mathrm{Conv}( \mathrm{Token} ),\quad A = \sigma(\mathrm{Conv}(F)),\quad F' = F \odot A,\quad T = \mathrm{MLP}(\mathrm{Pool}(F'))$

Loss combines action MSE and CAP auxiliary loss.

Structured CoT Reasoning (Wang et al., 26 Sep 2025): Every VLA inference outputs not only action but a structured, interpretable rationalization. Prompts elicit step-wise JSON with discrete commands and sub-task flags, facilitating task traceability and system debuggability.
Prompt-Based Zero-Data Transfer: UnderwaterVLA demonstrates robust navigation performance in the absence of new underwater demonstration data by leveraging foundation models with carefully designed prompts and physics-aware downstream modules.

4. Control, Mapping, and Integration

Hydrodynamics-informed Model Predictive Control (MPC) is a central component in UnderwaterVLA robotic execution (Wang et al., 26 Sep 2025). The MPC core minimizes a cost over a 1 s horizon: $J = \sum_{k=0}^{N-1} [\beta \|v_k - v_k^{\mathrm{ref}}\|^2 + \beta \|\theta_k - \theta_k^{\mathrm{ref}}\|^2 + \gamma \|\tau_k\|^2 + \delta D(v_k, r_k)]$ with drag penalties and actuator constraints, where drag coefficients $D_v, D_r$ are adapted in real time from sensor feedback: $\hat{D}_v = \frac{\tau_v - M\dot{v}}{v|v|}, \qquad \hat{D}_r = \frac{\tau_r - I_z\dot{r}}{r|r|}$

SVIn2-based visual-inertial odometry pipelines (Wang et al., 2023) provide dense mapping in real time, fusing stereo depth maps with multi-view confidence weighting and occlusion handling.

5. Empirical Evaluation and Performance Metrics

5.1 Robot Autonomy and Control

Task Completion: UnderwaterVLA's dual-brain system yields 19–27% higher completion rates in complex navigation tasks under turbidity and low-light than simulation-only or single-brain PID control (Wang et al., 26 Sep 2025).
Action Error: USIM/U0 evaluation demonstrates that fine-tuning on underwater simulation data reduces action prediction error by 67%, with CAP providing an additional 4–8% improvement (Gu et al., 9 Oct 2025). Binocular vision brings further benefits over monocular input.

5.2 Vision-Language Understanding Benchmarks

Captioning: On UWBench, GPT-5 achieves BLEU-4 = 14.90; open-source models lag by ~6 BLEU-4. Underwater visual degradation causes a drop of ~40% in BLEU-4 relative to terrestrial datasets (Zhang et al., 21 Oct 2025).
Grounding: Specialized pre-training is critical; Qwen3-VL-30B-Instruct attains [email protected] = 94.40%, while generalized models like GPT-4o perform substantially worse.
Video-Language Understanding: UVLM fine-tuning provides 10–26% improvements in composite UVLU score, and still leaves open challenges in object localization under high turbidity and fine-grained behavior description (Xue et al., 3 Jul 2025).

6. Communication, Distributed Perception, and Networked Estimation

UnderwaterVLA is also used to refer to adaptive distributed estimation over underwater visible light communication (UVLC) networks (Abdavinejad et al., 2022). Here, diffusion adaptation is analyzed in the presence of optical noise and turbulence, with analytic and simulation results establishing:

Reliable operation at node spacings $d < 10$ m, salinity $\{\mathcal{S}_i\}_{i=1}^N = \Phi_{\mathrm{cloud}}(L,I), \qquad \mathbf{u}_t = \Phi_{\mathrm{local}}(\mathcal{S}_t, x_t)$ 0 ppt, and temperature $\{\mathcal{S}_i\}_{i=1}^N = \Phi_{\mathrm{cloud}}(L,I), \qquad \mathbf{u}_t = \Phi_{\mathrm{local}}(\mathcal{S}_t, x_t)$ 1C.
Combine-Then-Adapt (CTA) diffusion adaptation outperforms ATC by 2–3 dB in steady-state Mean Square Deviation (MSD), with network MSD remaining below –30 dB in favorable regimes.

Key operational guidance includes step size choices ( $\{\mathcal{S}_i\}_{i=1}^N = \Phi_{\mathrm{cloud}}(L,I), \qquad \mathbf{u}_t = \Phi_{\mathrm{local}}(\mathcal{S}_t, x_t)$ 2), leveraging larger receiver apertures and transmit power under harsher channel conditions, and strictly controlling node deployment topology and environmental factors.

7. Open Research Challenges and Future Directions

Persistent underwater VLA challenges include:

Achieving robust generalization in the presence of domain shift caused by absorption, scattering, and chromatic distortion. Physics-aware augmentation and domain-specific self-supervised pretraining are outstanding directions (Zhang et al., 21 Oct 2025).
Integrating additional modalities such as active sonar and multi-channel sensors to mitigate visibility loss.
Enhancing prompting and model architectures with ecological knowledge, turbidity-aware attention, and color correction subnetworks.
Real-world deployment and full-scale coordination under asynchronous, high-latency acoustic communications, including multi-AUV settings (Wang et al., 26 Sep 2025).
Continuous mapping with volumetric representations, learning-based perception fusion, and stronger drift correction in feature-sparse subsea environments (Wang et al., 2023).

USIM, UWBench, and UVLM collectively offer scalable, rigorously constructed benchmarks for advancing research in underwater vision-language-action, with documented gaps relative to terrestrial performance indicating substantial room for innovation in both perception and control domains (Gu et al., 9 Oct 2025, Zhang et al., 21 Oct 2025, Xue et al., 3 Jul 2025).