Vlaser-6M: Dataset for Embodied Reasoning

Updated 15 October 2025

Vlaser-6M is a large-scale, multi-modal dataset integrating visual, linguistic, spatial, and action data for embodied reasoning and robot control.
It comprises 6 million annotated samples across embodied grounding, spatial reasoning, planning, and in-domain VLA modalities to enhance both high-level reasoning and low-level policy learning.
Its innovative dual-source design bridges internet-scale reasoning data with simulation-driven robotic interactions, significantly improving closed-loop control performance.

The Vlaser-6M dataset is a large-scale, multi-modal corpus explicitly constructed as a foundational resource for embodied reasoning and robot control. Developed in the context of advancing Vision-Language-Action (VLA) models, it integrates high-level reasoning tasks with low-level robotic policy learning, thereby targeting the longstanding gap between internet-scale pretraining and domain-specific robot embodiment. Vlaser-6M encompasses extensive annotations across visual, linguistic, spatial, and action modalities, and underpins state-of-the-art results for the Vlaser model on established benchmarks of embodied reasoning.

1. Dataset Composition and Structure

Vlaser-6M is composed of approximately six million annotated samples organized across four principal domains:

Embodied Grounding Data: 1.5 million question–answer pairs feature detailed 2D spatial annotations, offered both as bounding boxes and center points normalized to $[0,1000]$ . An auxiliary 300,000 samples utilize segmentation masks from SA-1B, converted to spatial formats for robust anchoring.
General and Spatial Reasoning Data: This partition includes 1.2 million robotic visual QA items (i.e., RoboVQA), 500,000 samples on spatial intelligence, and 100,000 spatial reasoning cases curated from 3D scene datasets (ScanNet, ScanNet++, ARKitScenes).
Planning Data: Comprised of 400,000 multimodal planning samples merging natural language instructions with sensorimotor sequences; sources include Alpaca-15k-Instruction, MuEP, and simulator-induced planning trajectories with LLaRP annotations.
In-Domain VLA Data: Two million multimodal QA pairs captured in the SimplerEnv simulation framework for Google Robot and WidowX, mapping instructions, visual observations, and sequences of low-level agent actions to the robot’s operational space.

This composition is summarized below:

Component	Volume	Modalities
Embodied Grounding	1.8M	QA, bbox, point, segmentation
General/Spatial Reasoning	1.8M	QA, scene graphs, 3D scenes
Planning	0.4M	Instructions, multi-modal
In-Domain VLA	2.0M	QA, observations, actions

2. Data Acquisition and Annotation Pipeline

The dataset is constructed via the “Vlaser data engine”, which curates, reorganizes, and annotates both public internet datasets and simulation-generated samples. The pipeline includes:

Embodied Grounding: Aggregation from sources such as RoboPoint, ShareRobot, Pixmo-Points, Paco-Lavis, and RefSpatial, complemented by segmentation mask conversion (SA-1B) through a two-stage spatial annotation. Minimal axis-aligned rectangles yield bounding boxes, high IoU threshold sampling generates point coordinates. The captioning pipeline leverages BLIP-2 for initial descriptions, refined by Qwen2.5-VL-7B, ensuring high-quality contextual grounding.
Spatial Reasoning: Manual annotation over 3D scenes exploits object counts, bounding boxes, and scene graph layouts to engineer diverse queries, emphasizing spatial complexity.
Planning Data: Generation leverages the Habitat simulator, annotated with LLaRP task specifications. Using an LLM agent (GPT-4o), sequences are rolled out, and only success trajectories that complete the task are retained.
In-Domain VLA Data: SimplerEnv platform records synchronized visual, linguistic, and action modalities during agent-platform interactions, which are then assembled into multimodal tuples for policy learning.

3. Application in Embodied Reasoning and Control Benchmarks

Vlaser-6M forms the primary training source for the Vlaser model, facilitating advances across the full spectrum of embodied reasoning benchmarks:

Embodied QA: The dataset supports high-fidelity question-answering about the agent’s visual context.
Visual Grounding: Benchmarks include Where2Place, PointArena, Paco-Lavis, and Pixmo-Points with challenging spatial referencing.
Spatial Intelligence: Tasks assessed on VSI-Bench, RefSpatial, MMSI-Bench, emphasizing spatial representation and manipulation.
Planning: Evaluations conducted on both open-loop and closed-loop protocols in ALFRED, Habitat, and SimplerEnv.
Downstream Robot Control: Demonstrated improvements in the rate of convergence and closed-loop control success on both the WidowX and Google Robot platforms, attributable to the rich, in-domain simulation data.

4. Bridging Embodied Reasoning and Policy Learning

The dataset’s dual-source design—combining internet-sourced reasoning data with in-domain, simulation-derived robot data—enables the Vision-Language backbone (initially InternVL3) to acquire both general reasoning and specialized policy capabilities. Embodied reasoning is strengthened through diverse QA and spatial reasoning items, while domain gap mitigation for downstream policy learning is achieved by exposing the model to robot-specific interaction data.

A fundamental insight from empirical analysis is that while out-of-domain samples foster upstream reasoning, the congruence of in-domain simulation data with actual robotic trajectories is most effective for closing the gap between pretraining and embodied action execution. This is particularly evident in closed-loop robot control metrics, which exhibit substantial improvement on VLA tasks following in-domain pretraining.

5. Training Objectives and Mathematical Formulation

Vision-language pretraining employs the auto-regressive language modeling objective: $\mathcal{L}_{lm} = - \log p\left(t_n \mid \mathcal{A}_F^v(x; \theta_v), \mathcal{A}_F^t(y), t_0:{n-1}; \Theta\right)$ where $x$ is an input image, $y$ is the textual prompt, $t_i$ denotes word tokens, and $\mathcal{A}_F^v$ / $\mathcal{A}_F^t$ denote the vision transformer/MLP and tokenizer components.

For vision-language-action finetuning, flow-matching-based action expert optimization minimizes: $\mathcal{L}_{vla} = \left\| v_\theta(A_t^{\tau}, o_t) - u(A_t^{\tau} | A_t) \right\|^2$ with $A_t$ representing a horizon-wide action chunk, $A_t^{\tau} = \tau A_t + (1-\tau)\epsilon$ as its noised version, $v_\theta(\cdot)$ the denoising vector field, $u(A_t^{\tau}|A_t) = \epsilon - A_t$ the target vector field, and $o_t$ the observation.

Inference iterates as: $A_t^{(\tau+\delta)} = A_t^{\tau} + \delta \cdot v_\theta(A_t^{\tau}, o_t)$ with $\delta$ being a step size parameter.

6. Innovations and Strategic Insights

The Vlaser-6M dataset demonstrates methodological innovations by merging curated internet-scale reasoning sources with realistic, simulation-driven robot interactions. A plausible implication is that this strategy successfully mitigates domain shift commonly observed in embodied systems trained on general data but evaluated in specialized robotic settings. Further, systematic ablation verifies that the balance of data streams (for example, spatial QA versus in-domain VLA) distinctly influences policy learning efficiency.

Notably, the design of Vlaser-6M substantiates that enhancement in high-level reasoning capabilities does not inherently translate to improved low-level robot control unless the domain gap is explicitly addressed. This suggests ongoing value in developing compositional data pipelines that reflect both broad conceptual tasks and domain-specific action distributions. The authors highlight that careful orchestration of QA, grounding, reasoning, and control samples is required to maximize VLA model performance across all operational axes.

7. Context and Significance in Embodied AI Research

Vlaser-6M’s synthesis of multimodal annotation schemes, multi-stage curation, and its strategic balance between internet-scale and robot-domain data reflect current priorities in embodied AI—namely, achieving synergistic reasoning and control. As the underlying dataset for Vlaser, it supports advancements in spatial reasoning, multi-step planning, and practical robot deployment. Its empirical impact on downstream tasks (e.g., WidowX, Google Robot) demonstrates its utility in bridging academic research and practical robot policy development, and its architecture can inform future large-scale datasets for embodied VLA systems.

In summary, Vlaser-6M represents a comprehensive, systematically engineered resource for Vision-Language-Action modeling, providing critical infrastructure for robust embodied reasoning and precise closed-loop robotic control.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vlaser-6M Dataset.