Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 93 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 17 tok/s

GPT-5 High 14 tok/s Pro

GPT-4o 97 tok/s

GPT OSS 120B 455 tok/s Pro

Kimi K2 194 tok/s Pro

2000 character limit reached

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey (2508.13073v1)

Published 18 Aug 2025 in cs.RO

Abstract: Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-LLMs (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation.

Collections

Summary

The paper presents a comprehensive survey on integrating pre-trained VLMs within VLA frameworks to boost robot manipulation.
It explains two architectural paradigms—monolithic models for unified processing and hierarchical models that decouple planning from execution.
The survey highlights advanced integration techniques and diverse datasets that enhance model generalization in dynamic, real-world tasks.

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Introduction

The integration of large Vision-LLMs (VLMs) within the Vision-Language-Action (VLA) framework presents a transformative paradigm for robotic manipulation. Traditional approaches in robotics, often constrained by limited predefined task specifications, struggle to generalize in dynamic or novel scenarios. In contrast, VLA models leveraging pre-trained VLMs on vast image-text datasets offer enhanced capabilities like open-world generalization, hierarchical task planning, knowledge-augmented reasoning, and rich multimodal fusion. These capabilities allow robots to understand high-level instructions, recognize unseen environments, and perform complex tasks.

Figure 1: Illustration of core advantages of large VLM-based Vision-Language-Action (VLA) models for robotic manipulation.

Architectural Paradigms

VLA models can be broadly categorized into two principal architectural paradigms: monolithic models and hierarchical models.

Monolithic Models

Monolithic models integrate environmental comprehension and action generation into either single-system or dual-system architectures:

Single-System Models: These involve a unified architecture that processes visual perception, language comprehension, and robot states, culminating in action generation. RT-2, an example, co-trains on internet-scale vision-language tasks alongside real robot trajectories, adapting its large VLM backbone for robust action generalization.
Dual-System Models: This architecture splits functionality, typically involving a slower, reflective VLM for high-level planning (System 2) and a fast, specialized module for low-level control (System 1). It allows combining deep multimodal reasoning with efficient task execution.
Figure 2: Comparison of the two principal categories of large VLM-based VLA models.

Hierarchical Models

In hierarchical models, the planning phase is explicitly decoupled from the execution phase. These models include:

Planner-Only: Models use VLMs to generate actionable intermediate representations such as keypoints or subtask series.
Planner+Policy: Such frameworks feature both a high-level planner and low-level policy module, enhancing flexibility and task execution fidelity.
Figure 3: A diagram showing the hierarchical models in this survey.

Advanced Integration Techniques

Beyond core architectures, contemporary research explores techniques like memory mechanisms, 4D perception, effective adaptation, and multi-agent cooperation. Training strategies increasingly incorporate eccentric modalities like tactile data or auditory cues, further enriching the VLA model's contextual grounding and interaction scope.

Datasets and Benchmarks

VLA model evaluation depends critically on diverse datasets and robust benchmarking:

Real-World Datasets: Essential for training models that must interpret and act in complex, dynamic environments.
Simulation Benchmarks: Provide controlled environments for assessing model performance and help bridge sim-to-real deployment gaps.
Human Behavior Datasets: Capture rich semantic and contextual nuances from human interactions, enhancing model grounding and adaptability.
Figure 4: Illustration of four dataset types that underpin large VLM-based VLA models for robotic manipulation.

Conclusion

The survey consolidates advances in VLA models built on large VLMs, offering a structured perspective on their architectures and capabilities. Notable trends point towards more generalizable models capable of adeptly handling open-world challenges while being scalable and robust enough for practical deployment. As the field evolves, future research will likely focus on seamless integration across modalities, enhanced cross-domain generalization, and lifelong learning capabilities to navigate continuously changing real-world environments.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (7)

GitHub

GitHub - JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation (3 stars)

alphaXiv

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey (46 likes, 0 questions)