Large VLM-Based VLA Models Overview

Updated 3 July 2026

Large VLM-based VLA models are multimodal architectures that fuse pretrained vision-language features with action control mechanisms to execute instructions and generalize tasks.
They employ monolithic or hierarchical designs with techniques like behavioral cloning, diffusion objectives, and parameter-efficient fine-tuning to overcome adaptation challenges.
Empirical evaluations reveal robust performance on basic perceptual tasks while highlighting semantic knowledge retention issues in complex, action-oriented evaluations.

Large Vision-LLM–Based Vision-Language-Action (VLA) Models

Large Vision-LLM (VLM)–based Vision-Language-Action (VLA) models are a foundational paradigm in modern embodied AI and robotic manipulation. These architectures integrate pretrained, multi-billion-parameter VLMs—such as InternVL, Qwen-VL, Prismatic, and others—as multimodal perceptual backbones, which are then adapted to downstream action policy learning. VLA models leverage high-level vision–language representations for instruction following, task generalization, and world-knowledge probing, and are now evaluated systematically for their semantic retention, adaptation bottlenecks, and architectural pathways from perception to motor control (Shao et al., 18 Aug 2025, Kachaev et al., 17 Jun 2026).

1. Core Architectural Principles and Taxonomy

A large VLM-based VLA model satisfies two criteria: (1) it leverages a pretrained large VLM for raw visual and linguistic input fusion, and (2) it produces actions—either as continuous control vectors or discrete tokens—by reasoning over this joint semantic space (Shao et al., 18 Aug 2025).

Two principal architectural classes are observed:

Monolithic (Single- or Dual-System): These models concatenate all input modalities (images, natural-language instructions, optional proprioception) and process them jointly via a transformer VLM backbone. In “single-system,” the action is decoded directly; in “dual-system,” a VLM is coupled with a lightweight action expert (e.g., flow-matching or diffusion module), sometimes via shared attention (Zhang et al., 6 Jan 2026, Xiao et al., 27 Apr 2026). Example: π₀ with shared attention between InternVL and a flow-matching action head.
Hierarchical Models: Planning and low-level execution are separated. A high-level planner (VLM or LLM) decomposes tasks (e.g., into subgoals or programs), while a lower-level VLA controls physical execution, often via learned action modules. This design is prominent in options-style and dual-process VLA systems (Hu et al., 9 Jun 2026, Han et al., 2024).

Key architectural modules universally include:

Vision encoder: ViT or CLIP-style patch embedder.
Language encoder: BPE or LLM-based.
Fusion transformer: Alternating cross- and self-attention layers produce joint embeddings.
Action head or expert: May be an MLP, flow-matching, or diffusion transformer, frequently with low-rank adaptation.
Optional: episodic memory, BEV (bird’s-eye view) spatial encoders (as in DriveStack-VLA (Wang et al., 23 Jun 2026)), or specialized routing modules.

2. Training Objectives, Adaptation Pathways, and Parameter Efficiency

Adaptation of pretrained large VLMs to VLA control is nontrivial due to catastrophic forgetting, representation drift, and a fundamental semantic–motor gap (Kachaev et al., 17 Jun 2026, Jiang et al., 7 May 2026). The dominant adaptation and training strategies are:

Behavioral Cloning (BC): Policy is trained to maximize the likelihood of expert actions given fused vision–language states:

$\mathcal{L}_{\rm BC} = -\frac1N\sum_{i=1}^N\log \pi_\theta(a_i^* \mid s_i)$

where $s_i$ denotes the fused features, and $a_i^*$ are ground-truth actions (Kachaev et al., 17 Jun 2026).

Diffusion or Flow-Matching Objectives: Action heads are often trained under denoising or flow-matching criteria on noised action trajectories, enabling parallel or chunked action prediction and improving convergence in low-level control (Ye et al., 27 Dec 2025, Xiao et al., 27 Apr 2026).
Parameter-Efficient Fine-Tuning (PEFT): To limit forgetting of semantic knowledge, techniques such as LoRA and more advanced structured adapters (e.g., Generalized & Specialized Experts (GSE) with spectral decomposition) are employed, updating $<3\%$ of total parameters while anchoring critical visual–semantic subspaces (Jiang et al., 7 May 2026). For example, VLA-GSE outperforms full fine-tuning and standard LoRA on zero-shot LIBERO-Plus with only 2.51% trainable parameters.
Multi-Task and Continual Learning: Interleaving VQA and action objectives during fine-tuning improves semantic retention in the backbone (notably in models such as Magma and Xiaomi-R0) (Kachaev et al., 17 Jun 2026).

3. Empirical Functionality, Knowledge Retention, and Probing

Large VLM-based VLA models exhibit robust performance on perceptual primitives but marked deficits on semantic and commonsense-rich action tasks if naive adaptation is used:

Primitive tasks (Color, Shape): SR ≥ 80–95% is common across modern VLAs.
Semantic/world knowledge tasks: Action-grounded success rates (SR) drop to 50–60%—barely above chance—on “Act2Answer” tests, even when backbones achieve 70–95% in text-only QA probes (Kachaev et al., 17 Jun 2026).

Layerwise linear probing reveals:

Answer-relevant information peaks in mid-backbone layers (≈75–80% for semantic tasks in layers 6–12 of 24), but degrades to ≈55–60% in the final action-predicting layer.
The chance-normalized retention metric typically ranges from 0.36 (π₀) to 0.87 (Magma), implicating control adaptation as the primary cause of knowledge loss.

Empirically validated strategies:

VQA co-training: Yields 5–20 point absolute gains on semantic tasks.
Action head adaptation: Advanced routing (e.g., DAM-VLA) and knowledge-preserving adapters (VLA-GSE) reduce forgetful drift and improve multi-domain generalization and OOD robustness (Jiang et al., 7 May 2026, Peng et al., 1 Mar 2026).

4. Design Advancements and Representative Models

Prominent models exemplifying contemporary trajectories include:

Model	Key Innovations	Reported Performance
OpenVLA	End-to-end single-system	Baseline generalist, high OOD drop
π₀	Dual-system: VLM+flow-matching	70–85% SR on Libero, ≈60 SR in OOD
M²-VLA	Mixture-of-Layers (MoL) + MSM	95.3% avg. SR (Libero), 75% real-world
VLA-GSE	SVD-based split adapters	81.2% (LIBERO-Plus, 2.51% params), 82.5% OOD
DAM-VLA	Action routing + dual diffusion	98%–100% on long-horizon manipulation
VITA-VLA	Small-model action distillation	97.3% (LIBERO), 82% real-world
Dream-VLA	Diffusion-LM backbone	97.2% (LIBERO), 71.4% (SimplerEnv-Bridge)
Hierarchical	Planner/controller split	+41 pp (long-horizon) vs. flat, robust in sim & real

Distinct strategies include:

MoL and MSM (M²-VLA): Filters and meta-skill modules for frozen VLM backbones, preserving semantic cues for manipulation (Xiao et al., 27 Apr 2026).
Spectral Adapters (VLA-GSE): Shared-generalized and dynamic-specialized expert adapters initialized by SVD, enhancing adaptation under tight parameter budgets (Jiang et al., 7 May 2026).
Dual-Level Action Representation (iFlyBot-VLA): Joint supervision on latent actions and explicit tokenized kinematics for maximizing alignment of semantic and motor spaces (Zhang et al., 1 Nov 2025).
Diffusion Backbones (Dream-VLA): Bidirectional, parallel language–vision–action diffusion transformers, enabling faster chunked control and low error accumulation (Ye et al., 27 Dec 2025).

5. Hierarchical Control and Integration with LLMs

Options-style and dual-process decompositions enhance both reliability and sample efficiency in complex agentic tasks:

Options Framework: High-level planners (VLM/LLM) operate at slow rates, selecting subgoal instructions $z_t$ to be executed by low-level VLA controllers (Hu et al., 9 Jun 2026). Termination is handled either by fixed intervals, explicit LLM-based success detectors, or privileged scene signals.
Dual-Process Models (DP-VLA): Separate heavyweight VLMs for episodic semantic-planning (System 2) from lightweight high-rate control policies (System 1), yielding both online efficiency and semantic robustness (57% avg. RoboCasa success, with System 2 invoked once per episode) (Han et al., 2024).

Scalable frameworks (e.g., VLA Foundry, (Mercat et al., 21 Apr 2026)) provide fully unified LLM→VLM→VLA pipelines across language, vision, and action, facilitating controlled architecture swaps and evaluable ablations.

6. Knowledge Retention, Adaptation Bottleneck, and Best Practices

Systematic measurement in “Does VLA Even Know the Basics?” (Kachaev et al., 17 Jun 2026) and (Lin et al., 25 May 2026) converges on key adaptation bottlenecks and mitigation:

Findings:

Naive fine-tuning of large VLMs for control causes severe erosion of high-level knowledge, despite backbone scale.
BC/RL objectives alone fail to preserve semantic content; multi-task or continual integration of VQA tasks in training is essential for semantic-head activity.
LoRA and other low-rank parameter-efficient schemes are preferable to full fine-tuning, yielding better retention and transfer. Gains from different auxiliary tasks are not additive and can saturate or degrade with over-mixing (Lin et al., 25 May 2026).
The vision encoder is often the bottleneck; control-relevant objectives and fine-tuned adaptation to robot tasks are required.

Recommendations:

Employ multi-stage training, with staged LoRA or specialized adapters.
Add mid-layer L2 distillation losses ( $\mathcal{L}_{\rm distill}=\|h^{\rm VLA}_n - h^{\rm teacher}_n\|^2$ ).
Decouple knowledge retrieval (semantic/planning) from action generation, especially in dual-stream and options-based designs.
For scalability, retain a frozen, strong VLM for perception, and adapt only minimal architecture (e.g., action heads, structured adapters).

7. Future Directions, Benchmarks, and Challenges

Current large VLM-based VLA models excel on standardized manipulation datasets and are now extending to:

4D and multimodal perception (e.g., point clouds, spatiotemporal video transformers)
World-model integration and speculative planning (Shao et al., 18 Aug 2025)
Memory-augmented and continual learning for lifelong open-world adaptation
Asynchronous inference, model compression, and adaptive planning-control frequencies for edge deployment (Han et al., 2024, Tang et al., 27 Jul 2025)
Simulation-to-real transfer, leveraging mixed visual–semantic pretraining and real-robot demonstration mixtures

Benchmark development is guided by:

LIBERO (zero-shot & OOD), SimplerEnv, RoboCasa, FurnitureBench, LBM Eval, NAVSIM/Bench2Drive for driving.
Action-grounded commonsense evaluation protocols (Act2Answer (Kachaev et al., 17 Jun 2026)) to decouple knowledge loss from control error.

Addressing ongoing challenges—including semantic–motor alignment, catastrophic forgetting, and robust open-world deployment—remains at the forefront of large VLM-based VLA model research.