Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 29 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

OmniVLA: Omni-Modal Robot Navigation

Updated 25 September 2025
  • OmniVLA is an omni-modal vision–language–action model that fuses visual, linguistic, and positional inputs for flexible robot navigation.
  • It employs modality dropout and a shared token space to achieve robust cross-modal generalization across diverse goal specifications.
  • Extensive evaluations show superior performance over single-modality baselines, enabling scalable deployment in varied real-world scenarios.

OmniVLA is an omni-modal vision–language–action (VLA) model for robot navigation designed to enable flexible, robust navigation policies that condition on multiple, heterogeneous goal specifications. In contrast to prior robotic navigation systems that typically support a single modality—such as language, spatial coordinates, or visual references—OmniVLA can simultaneously interpret and compose diverse goal inputs (including egocentric images, 2D poses, and natural language) and demonstrates strong generalization and robustness in both simulation and real-world evaluation (Hirose et al., 23 Sep 2025).

1. Architectural Overview

At its core, OmniVLA utilizes a high-capacity VLA backbone initially based on OpenVLA, extended to incorporate omni-modal goal conditioning. The architecture processes the current egocentric visual observation of the robot through a dedicated visual encoder, and flexibly integrates three distinct goal modalities:

  • Natural language instructions (linguistic)
  • Egocentric goal images (visual)
  • 2D goal poses (positional)

Each modality is embedded into a shared token space before input into the LLM backbone, enabling unified multimodal reasoning. During training, random masking—denoted “modality dropout”—is applied to subsets of goal modalities, preventing overreliance on any single input and fostering resilient cross-modal representation learning. The output is produced using a linear action head appended to the LLM backbone, predicting the next sequence of navigation actions {aˉi}\{\,\bar{a}_i\,\}.

The policy function may be summarized as:

{aˉi}i=1n=πθ(Ic,Ig,pg,lg,tm)\{\,\bar{a}_i\,\}_{i=1\ldots n} = \pi_\theta(I_c,\,I_g,\,p_g,\,l_g,\,t_m)

where IcI_c is the current egocentric image, IgI_g the goal image, pgp_g the 2D goal pose, lgl_g the language instruction, and tmt_m encodes the specific combination of modalities merged in a given training instance.

2. Training Paradigm and Objective Functions

OmniVLA is trained on an extensive mixture of datasets—over 9,500 hours, spanning 10 hardware platforms—incorporating navigation trajectories with raw 2D pose, language annotation, and visual goal images. The randomized modality fusion strategy independently samples the available modalities for each instance and constructs a corresponding attention mask to drop unavailable or intentionally omitted modalities.

The central loss function is imitation loss, quantifying the mean-squared error between predicted and expert action sequences:

Jil=1Ni=1N(airefaˉi)2J_{\mathrm{il}} = \frac{1}{N} \sum_{i=1}^{N} (a^{\text{ref}}_i - \bar{a}_i)^2

where airefa^{\text{ref}}_i is the reference (expert) action.

Additional objective terms further sculpt policy behavior:

  • JobjJ_{\mathrm{obj}}: Object-centric loss (active for language tasks) ensuring the final action is close to the target object's pose.
  • JsmJ_{\mathrm{sm}}: Smoothing loss penalizing abrupt changes between successive actions.

The full optimization target:

minθJ=Jil+mobjJobj+Jsm\min_\theta J = J_{\mathrm{il}} + m_{\mathrm{obj}}J_{\mathrm{obj}} + J_{\mathrm{sm}}

where mobjm_{\mathrm{obj}} is 1 on language tasks, 0 otherwise. The LoRA (Low-Rank Adaptation) technique is employed to constrain parameter updates, enabling increased batch sizes and improved convergence properties.

3. Modality Fusion and Cross-Modal Generalization

A central aspect of OmniVLA is the policy’s robustness and generalizability across missing or noisy modalities, realized through two mechanisms:

  • Modality Dropout: At training time, tokens from one or more modalities are randomly masked, ensuring the model must rely on the remaining information for decision making. This induces cross-modal robustness, such that the model can navigate even when, for instance, only a partial goal specification is available.
  • Flexible Encoder Swapping: Experiments demonstrate that new modalities (e.g., satellite imagery) can be introduced by adding/swapping encoders and briefly fine-tuning, leveraging the transferability of shared representations.

OmniVLA also adapts to OOD (out-of-distribution) instructions, for instance processing behavioral prompts not encountered in training (e.g., “move along the wall” or “move on the grass”). This suggests that the LLM backbone, pretrained on broad language corpora, imparts priors enabling flexible interpretation beyond seen navigation directives.

4. Evaluation and Comparative Performance

Extensive evaluations indicate that OmniVLA outperforms specialist baselines across all supported modalities and task variants:

  • Language-Conditioned Navigation: Achieves a reported language success rate (SR) of 0.73, surpassing baselines such as CoW, LeLaN, and CounterfactualVLA.
  • 2D Goal-Pose Navigation: Delivers approximately 9% improvement in both success rate and progress over MBRA-pose.
  • Egocentric Image-Conditioned Navigation: Matches 100% SR of top single-modality experts (e.g., MBRA-image), with maintained robustness.
  • Multi-modal (Combined) Goals: Maintains high performance (~80% SR) on tasks requiring compositional understanding, e.g., following “where” (pose) and “how” (language instruction) jointly.

When fine-tuned on new modalities, such as satellite images, the OmniVLA-edge variant demonstrates a substantial increase in SR (from 0.19 to ~0.62), underscoring the effectiveness of cross-modal pretraining and adaptation.

5. Applications, Scalability, and Extensions

OmniVLA's omni-modal input space enables deployment across diverse scenarios:

  • Autonomous delivery and inspection: Where tasks can be described via GPS, landmark images, or free-form language.
  • Human–robot interaction: Supporting intuitive goal specification by natural language or visual referencing.
  • Cross-embodiment: Policy transfer across disparate platforms (wheeled robots such as FrodoBots, VizBot, and quadrupeds like Unitree Go1) as a consequence of its data diversity and robust pretraining.

The model is architected for extensibility; encoders for new modalities—such as spatial sketches, higher abstraction instructions, or map-based references—can in principle be integrated and fine-tuned with minimal additional effort. This suggests a plausible trajectory toward continually expanding omni-modal policies.

6. Project Resources and Future Directions

The OmniVLA project releases comprehensive resources, including:

  • Video demonstrations documenting performance in diverse environments and platforms.
  • Pre-trained checkpoints for both the full model and a resource-efficient “OmniVLA-edge” variant.
  • Source code and documentation for training and fine-tuning on custom datasets and modalities.

Future research directions highlighted include:

  • Enhanced fusion of behavioral and spatial targets in language-conditioned navigation.
  • Expansion to carefully curated, larger datasets to further improve generalization.
  • Exploration of more complex multi-modal input combinations and policy compositionality.

The release of project materials at https://omnivla-nav.github.io supports reproducibility and accelerates further development.


OmniVLA advances the foundation model paradigm in robotics by integrating vision, language, and action into a single, general-purpose policy capable of robustly interpreting and composing diverse, real-world goal specifications. Its architectural flexibility, omnidirectional modality fusion, and strong empirical performance establish a scalable reference for future omni-modal navigation research (Hirose et al., 23 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OmniVLA.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube