Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning (2510.11027v1)

Published 13 Oct 2025 in cs.CV

Abstract: While significant research has focused on developing embodied reasoning capabilities using Vision-LLMs (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-LLM designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

Summary

The paper presents Vlaser, a novel VLA model that integrates high-level embodied reasoning with low-level control through a two-stage training process.
It employs a vision-language backbone and an action expert module, utilizing InternVL3 and flow matching to enhance robotic task planning and spatial reasoning.
Experiments show Vlaser significantly outperforms benchmarks in embodied QA, grounding, spatial reasoning, and planning, demonstrating robust real-world potential.

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Introduction

The paper introduces Vlaser, a Vision-Language-Action (VLA) model that integrates high-level reasoning with low-level control for embodied agents. Vlaser leverages the Vlaser-6M dataset to bridge embodied reasoning with VLA policy learning. The model achieves state-of-the-art performance across various embodied reasoning benchmarks, systematically exploring the impact of Vision-LLM (VLM) initializations on supervised VLA fine-tuning.

Figure 1: Overall framework, capabilities, and evaluation of Vlaser.

Methodology

Model Structure

Vlaser consists of two primary components: a vision-language backbone and an action expert module. The backbone enhances embodied reasoning capabilities, utilizing InternVL3 as a basis for vision and language encoding. It integrates superior multimodal understanding necessary for task planning and spatial reasoning.

Figure 2: An illustration of Vlaser architecture.

The action expert module incorporates flow matching for generating low-level control actions, supporting efficient robot policy learning and execution.

Data Engine

The Vlaser-6M data engine systematically compiles diverse data from various public sources for embodied QA, grounding, spatial reasoning, and planning. This data supports the model's ability in reasoning across multiple domains, enhancing both open-loop and closed-loop control performance.

Figure 3: An illustration of Vlaser-6M data engine for in-domain general QA samples.

Training Recipe

Vlaser encompasses a two-stage training approach: multimodal pretraining for embodied reasoning, followed by VLA-specific fine-tuning. This method ensures robust adaptation from broad pretraining data to domain-specific applications, facilitating the transition from perception to actionable control tasks.

Experiments

Performance Evaluation

Vlaser outperforms existing models on diverse embodied reasoning benchmarks by a significant margin, indicating its superior generalization capabilities in embodied QA, grounding, spatial reasoning, and planning.

Figure 4: An illustration of Vlaser-6M data engine for in-domain embodied grounding QA sample.

Its performance in simulation environments demonstrates high efficacy in handling complex manipulation tasks, proving its potential for real-world robotic applications.

Figure 5: An illustration of Vlaser-6M data engine for in-domain spatial reasoning QA sample.

In-Domain Data Analysis

The importance of in-domain data is evident, as Vlaser shows how strategic data selection greatly influences downstream success in VLA tasks. Specifically, in-domain multimodal QA pairs help enhance embodied perception and action planning capabilities.

Figure 6: Qualitative samples in SimplerEnv on WidowX Robot Tasks.

Vlaser builds on previous efforts in extending VLMs into VLA models, incorporating advanced reasoning tasks within embodied AI paradigms. It innovatively combines state-of-the-art architecture with meticulously curated datasets, redefining benchmarks for embodied intelligence.

Conclusion and Future Work

Vlaser represents a significant step forward in embodied AI, setting a new standard across varied reasoning domains. Future work can explore reducing domain gaps further and enhancing real-world dataset alignment. The deployment of such VLA models opens avenues for more autonomous and intelligent robotic systems.