OmniVLA-edge: Compact VLA Navigation

Updated 27 September 2025

OmniVLA-edge is a compact omni-modal vision-language-action navigation model that integrates visual, pose, and language inputs for flexible goal specification.
It employs modality-specific encoders and randomized fusion strategies, enhancing robustness to incomplete or ambiguous sensor data.
Its efficient design enables real-time deployment on edge devices, supporting cross-embodiment navigation in varied robotic applications.

OmniVLA-edge is a resource-constrained, omni-modal vision-language-action (VLA) navigation model, derived as a compact variant of the OmniVLA architecture for robot navigation. Designed explicitly for edge devices, OmniVLA-edge enables real-time deployment on platforms with limited computational capabilities while maintaining core properties of omni-modal goal specification, generalization, and robustness. It achieves flexible conditioning on multiple goal modalities—egocentric images, 2D spatial poses, and natural language instructions—by leveraging a streamlined transformer-based successor to large-scale vision-LLMs.

1. Model Design and Architecture

OmniVLA-edge implements omni-modal goal conditioning using a lightweight backbone. Built on ViNT, a navigation transformer with approximately 50 million parameters, it utilizes three parallel encoders for processing goal modalities:

Visual encoder for current and goal egocentric images,
Pose projector for 2D goal coordinates,
Language encoder derived from pre-trained LLMs.

These modality-specific embeddings are fused early—tokens from visual, pose, and linguistic channels are aggregated before entering the transformer. During training, a randomly determined goal modality (denoted as $t_m$ ) informs the construction of an attention mask, which restricts the model's focus according to available goal information. The transformer aggregates tokens from all modalities, and a linear prediction head produces a sequence of navigation actions, formalized as:

$\{\hat{a}_i\}_{i=1...N} = \pi_{\theta}(I_c, I_g, p_g, l_g, t_m)$

where $I_c$ is the current image, $I_g$ the goal image, $p_g$ the goal 2D pose, $l_g$ the natural language instruction, and $t_m$ the selected goal modality.

The main training objective is the mean squared error between predicted and reference action sequences:

$J_{il} = \frac{1}{N}\sum_{i=1}^N (a^{ref}_i - \hat{a}_i)^2$

Additional loss terms, such as $J_{obj}$ (to bias towards reaching language-specified targets) and $J_{sm}$ (to regularize action smoothness), are incorporated as needed for specific datasets or tasks.

2. Training Methodology and Data Regime

The OmniVLA-edge training regime draws from an extensive multi-modal, multi-platform data mixture, totaling approximately 9,500 hours and spanning 10 different robotic platforms. The framework employs two critical strategies:

Omni-modal conditioning: The policy is exposed to goal specifications drawn from images, poses, language, and their combinations. Each modality is encoded and projected into a common token space, enabling shared geometric, semantic, and visual representation learning.
Randomized modality fusion: For each batch in training, the model selects a subset of available modalities (including possible dropout) and constructs an attention mask to exclude unused modalities. This randomized selection and dropout enforce adaptation to incomplete, ambiguous, or noisy real-world input.

The training pipeline may also leverage dataset-specific objectives, such as explicitly guiding the final action to align more closely with labeled targets under language conditioning, or incorporating pre-annotated / synthetic action targets (e.g., MBRA actions). This diverse and randomized regime is central to omni-modal representation learning.

3. Robustness and Generalization

OmniVLA-edge develops strong generalization through unified, robust representation learning:

Unified modality representations: Jointly encoding image, pose, and language goals enables the model to ground navigation behavior in both geometric and semantic domains.
Randomized modality dropout: Enforcing modality availability at random during both training and inference imparts robustness to missing, incomplete, or noisy goal signals.
Cross-embodiment, cross-modality proficiency: The training distribution incorporates disparate platforms (small ground robots, quadrupeds, satellite imagery) and diverse goal formats. Empirical evaluation demonstrates that OmniVLA-edge adapts successfully to new goal modalities (e.g., satellite goal images) and transfers across robot types (e.g., from VizBot to Unitree Go1), maintaining high navigation success.

This approach translates into resilience in out-of-distribution (OOD) language directives, including behavioral instructions unseen during initial training (such as "move along the wall"), as well as tolerance to partial or missing sensory input.

4. Evaluation Criteria and Comparative Performance

OmniVLA-edge is evaluated using metrics standard to the navigation domain:

Metric	Description
Success Rate (SR)	Percentage of trials resulting in goal reached
Progress (Prog.)	Degree of partial advancement toward the specified goal
Behavior Score	Qualitative adherence to behavioral language prompts

In head-to-head comparisons with specialist baselines (e.g., MBRA-pose for 2D pose, LeLaN and CounterfactualVLA for language), OmniVLA-edge achieves either superior or equivalent results across modalities. Notably, it demonstrates increased success rates in pose-conditioned navigation and higher behavior scores in language-conditioned tasks, while maintaining robust partial progress when full goal information is unavailable.

5. Deployment Scenarios and Prospective Directions

OmniVLA-edge is suitable for scenarios where on-device efficiency, adaptability, and multi-modal human-robot interaction are critical, including:

Real-world navigation by delivery or inspection robots across complex, unstructured environments,
Cross-embodiment deployment (enabling a consistent control policy across heterogeneous robotic form factors),
Multi-modal scalable navigation (allowing human operators to specify goals via coordinates, images, or natural language as appropriate).

Prospective research directions include the use of larger, high-quality language datasets to further boost language-mediated navigation performance, extending model capabilities to embrace additional task modalities, refining fine-tuning paradigms for rapid adaptation with minimal new data, and integrating further internet-scale pre-training resources, while mitigating domain gaps between pre-training corpora and robotic deployment environments.

6. Resources and Supplementary Materials

The following resources are made available to promote research and deployment:

Curated videos demonstrating OmniVLA-edge across diverse modalities and platforms,
Release of pre-trained model checkpoints and source training code,
Documentation and updates provided at https://omnivla-nav.github.io/.

These materials are intended to facilitate replication, benchmarking, and extension of omni-modal navigation models, and to support ongoing research in generalizable foundation models for robotics.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to OmniVLA-edge.