OmniVLA: Omni-Modal Robot Navigation
- OmniVLA is an omni-modal vision–language–action model that fuses visual, linguistic, and positional inputs for flexible robot navigation.
- It employs modality dropout and a shared token space to achieve robust cross-modal generalization across diverse goal specifications.
- Extensive evaluations show superior performance over single-modality baselines, enabling scalable deployment in varied real-world scenarios.
OmniVLA is an omni-modal vision–language–action (VLA) model for robot navigation designed to enable flexible, robust navigation policies that condition on multiple, heterogeneous goal specifications. In contrast to prior robotic navigation systems that typically support a single modality—such as language, spatial coordinates, or visual references—OmniVLA can simultaneously interpret and compose diverse goal inputs (including egocentric images, 2D poses, and natural language) and demonstrates strong generalization and robustness in both simulation and real-world evaluation (Hirose et al., 23 Sep 2025).
1. Architectural Overview
At its core, OmniVLA utilizes a high-capacity VLA backbone initially based on OpenVLA, extended to incorporate omni-modal goal conditioning. The architecture processes the current egocentric visual observation of the robot through a dedicated visual encoder, and flexibly integrates three distinct goal modalities:
- Natural language instructions (linguistic)
- Egocentric goal images (visual)
- 2D goal poses (positional)
Each modality is embedded into a shared token space before input into the LLM backbone, enabling unified multimodal reasoning. During training, random masking—denoted “modality dropout”—is applied to subsets of goal modalities, preventing overreliance on any single input and fostering resilient cross-modal representation learning. The output is produced using a linear action head appended to the LLM backbone, predicting the next sequence of navigation actions .
The policy function may be summarized as:
where is the current egocentric image, the goal image, the 2D goal pose, the language instruction, and encodes the specific combination of modalities merged in a given training instance.
2. Training Paradigm and Objective Functions
OmniVLA is trained on an extensive mixture of datasets—over 9,500 hours, spanning 10 hardware platforms—incorporating navigation trajectories with raw 2D pose, language annotation, and visual goal images. The randomized modality fusion strategy independently samples the available modalities for each instance and constructs a corresponding attention mask to drop unavailable or intentionally omitted modalities.
The central loss function is imitation loss, quantifying the mean-squared error between predicted and expert action sequences:
where is the reference (expert) action.
Additional objective terms further sculpt policy behavior:
- : Object-centric loss (active for language tasks) ensuring the final action is close to the target object's pose.
- : Smoothing loss penalizing abrupt changes between successive actions.
The full optimization target:
where is 1 on language tasks, 0 otherwise. The LoRA (Low-Rank Adaptation) technique is employed to constrain parameter updates, enabling increased batch sizes and improved convergence properties.
3. Modality Fusion and Cross-Modal Generalization
A central aspect of OmniVLA is the policy’s robustness and generalizability across missing or noisy modalities, realized through two mechanisms:
- Modality Dropout: At training time, tokens from one or more modalities are randomly masked, ensuring the model must rely on the remaining information for decision making. This induces cross-modal robustness, such that the model can navigate even when, for instance, only a partial goal specification is available.
- Flexible Encoder Swapping: Experiments demonstrate that new modalities (e.g., satellite imagery) can be introduced by adding/swapping encoders and briefly fine-tuning, leveraging the transferability of shared representations.
OmniVLA also adapts to OOD (out-of-distribution) instructions, for instance processing behavioral prompts not encountered in training (e.g., “move along the wall” or “move on the grass”). This suggests that the LLM backbone, pretrained on broad language corpora, imparts priors enabling flexible interpretation beyond seen navigation directives.
4. Evaluation and Comparative Performance
Extensive evaluations indicate that OmniVLA outperforms specialist baselines across all supported modalities and task variants:
- Language-Conditioned Navigation: Achieves a reported language success rate (SR) of 0.73, surpassing baselines such as CoW, LeLaN, and CounterfactualVLA.
- 2D Goal-Pose Navigation: Delivers approximately 9% improvement in both success rate and progress over MBRA-pose.
- Egocentric Image-Conditioned Navigation: Matches 100% SR of top single-modality experts (e.g., MBRA-image), with maintained robustness.
- Multi-modal (Combined) Goals: Maintains high performance (~80% SR) on tasks requiring compositional understanding, e.g., following “where” (pose) and “how” (language instruction) jointly.
When fine-tuned on new modalities, such as satellite images, the OmniVLA-edge variant demonstrates a substantial increase in SR (from 0.19 to ~0.62), underscoring the effectiveness of cross-modal pretraining and adaptation.
5. Applications, Scalability, and Extensions
OmniVLA's omni-modal input space enables deployment across diverse scenarios:
- Autonomous delivery and inspection: Where tasks can be described via GPS, landmark images, or free-form language.
- Human–robot interaction: Supporting intuitive goal specification by natural language or visual referencing.
- Cross-embodiment: Policy transfer across disparate platforms (wheeled robots such as FrodoBots, VizBot, and quadrupeds like Unitree Go1) as a consequence of its data diversity and robust pretraining.
The model is architected for extensibility; encoders for new modalities—such as spatial sketches, higher abstraction instructions, or map-based references—can in principle be integrated and fine-tuned with minimal additional effort. This suggests a plausible trajectory toward continually expanding omni-modal policies.
6. Project Resources and Future Directions
The OmniVLA project releases comprehensive resources, including:
- Video demonstrations documenting performance in diverse environments and platforms.
- Pre-trained checkpoints for both the full model and a resource-efficient “OmniVLA-edge” variant.
- Source code and documentation for training and fine-tuning on custom datasets and modalities.
Future research directions highlighted include:
- Enhanced fusion of behavioral and spatial targets in language-conditioned navigation.
- Expansion to carefully curated, larger datasets to further improve generalization.
- Exploration of more complex multi-modal input combinations and policy compositionality.
The release of project materials at https://omnivla-nav.github.io supports reproducibility and accelerates further development.
OmniVLA advances the foundation model paradigm in robotics by integrating vision, language, and action into a single, general-purpose policy capable of robustly interpreting and composing diverse, real-world goal specifications. Its architectural flexibility, omnidirectional modality fusion, and strong empirical performance establish a scalable reference for future omni-modal navigation research (Hirose et al., 23 Sep 2025).