InternVLA-M1: Spatially Guided VLA for Robotics

Updated 17 October 2025

InternVLA-M1 is a unified vision–language–action framework that integrates spatial grounding to explicitly link language instructions with robotic actions.
The framework uses a two-stage pipeline—spatial grounding pre-training followed by spatially guided action post-training—to boost performance in both simulated and real-world tasks.
Empirical results demonstrate significant gains in spatial accuracy and multi-task execution efficiency across various robotic benchmarks.

InternVLA-M1 is a unified spatially guided Vision–Language–Action (VLA) framework designed to advance generalist robot instruction following toward scalable, general-purpose intelligence. Its central principle is spatially guided vision–language–action training, in which spatial grounding serves as an explicit, differentiable link between language instructions and robot actions. InternVLA-M1 employs a two-stage pipeline—spatial grounding pre-training followed by spatially guided action post-training—demonstrating substantial gains over baseline and previous models in simulation and real-world robotic environments (Chen et al., 15 Oct 2025).

1. Framework Overview

InternVLA-M1 comprises two tightly coupled modules:

VLM Planner (“System 2”): This module performs spatial grounding by interpreting free-form language instructions in context with visual observations, extracting spatial cues (such as locations, objects, or regions relevant to task execution). The planner is realized as a sequence model pre-trained on large-scale multimodal datasets for spatial QA tasks, and produces latent planning tokens via cross-attention layers.
Action Expert (“System 1”): This embodiment-aware module is a diffusion-based policy utilizing a DiT diffusion backbone and a DINOv2 visual encoder. It receives the planner’s latent planning tokens (conditioned by spatial prompts) and generates low-level continuous actions tailored to the physical embodiment.

These systems interact in a pipeline where an image and instruction are converted to latent tokens via the planner, further processed with a querying transformer, and then consumed by the action expert to produce motor commands.

[Image + Instruction] 
    │
[Spatial Grounding Pre-training (VLM Planner)]
    │ (Latent planning tokens / spatial prompts)
[Querying Transformer]
    │
[Action Expert (DiT Diffusion Policy)]
    │
[Embodiment-Aware Motor Commands]

This approach unifies spatial reasoning and embodied control, bridging abstract language understanding with direct physical execution.

2. Spatial Grounding Pre-training

The first stage trains the VLM planner to “know where to act” by aligning instructions with spatial elements of a scene:

Data Scale & Diversity: Pre-training leverages over 2.3 million spatial reasoning samples. Data types include Box QA (bounding boxes), Trajectory QA (spatial paths), Point QA (specific location points), and chain-of-thought spatial reasoning, sourced from datasets such as RefCOCO, RoboRefIt, MolmoAct, and Pixmo-Points.
Unified QA-Style Objective: The model is trained via next-token prediction to generate JSON/XML-formatted spatial coordinates or affordance regions from the language–vision input.
Spatial Representation: Outputs are latent planning tokens (denoted $z_p$ ) optimized via objectives such as

$L_{\text{spatial}} = \mathbb{E}[||z_{\text{pred}} - z_{\text{gt}}||^2]$

ensuring proximity to ground-truth spatial localization.

This stage ensures that the planner can generate robust spatial cues across diverse formats and tasks, essential for generalizable robot behavior.

3. Spatially Guided Action Post-training

Spatial grounding is then specialized and integrated for control tasks in the action post-training phase:

Spatial Prompting: Task instructions are augmented with explicit spatial cues. These cues activate and direct the planner’s spatial reasoning to enhance downstream motor execution, e.g., augmenting “store all toys into the toy box” with “identify all relevant toys and their spatial relationships to the container.”
Joint Co-training and Gradient Flow: Both the planner and action expert are jointly trained on demonstration data; for trajectory data (from closed-loop demonstrations), an $L_2$ loss is used between predicted and ground-truth end-effector trajectories. During updates from spatial grounding data, only the planner is updated to avoid distorting spatial priors. A gradient decay factor regulates backward flow from the action expert to the planner, protecting the original spatial knowledge.
Plug-and-Play Spatiality: The action expert is conditioned on “spatially rich” tokens, allowing rapid adaptation to new embodiments, tasks, or object arrangements.

This results in a policy model that is adaptively grounded in spatial awareness, producing continuous control signals robustly informed by multimodal, instruction-driven understanding.

4. Empirical Performance and Metrics

InternVLA-M1’s effectiveness is demonstrated by consistent improvements across simulation and real-robot benchmarks:

Benchmark/Scenario	Improvement over Baseline
SimplerEnv (Google Robot)	+14.6%
WidowX (robot)	+17.0%
LIBERO (Franka robot)	+4.3% (success rate)
Real-world Clustered Pick/Place	+20.6% (novel objects/config.)
Long-horizon Reasoning Tasks	>10% over existing works

Additional improvements are observed in spatial metrics such as box IoU, point accuracy, and trajectory mean absolute error.

On simulation, InternVLA-M1 is trained with 244,000 closed-loop, pick-and-place episodes covering 200 tasks and over 3,000 objects, resulting in a 6.2% average gain.
In the real world, with synthetic co-training, performance on unseen object types and arrangements increased by 20.6%.

These gains are consistent across both intra- and cross-robot evaluations, novel instructions, clutter, and dynamic change.

5. Simulation Engine and Data Collection

InternVLA-M1’s scalability is facilitated by an automated simulation and data pipeline:

Simulation Infrastructure: Built on GenManip and Isaac Sim, the engine separates physics simulation from rendering, enabling diverse sampling of object layouts, lighting, and camera perspectives.
Data Generation: 244,000 pick-and-place episodes are generated by randomizing environment factors and verifying task completion with a scene-graph solver.
Privileged Signals: Trajectories are annotated with detailed privileged information (object poses, candidate grasps), supporting high-fidelity imitation learning.

This systematic approach enables rapid expansion to new domains and tasks while preserving annotation quality and diversity.

6. Real-world and Long-horizon Applications

InternVLA-M1 demonstrates strong performance in real-world deployments and in complex, long-horizon scenarios:

Clustered Pick-and-Place: Using a Franka robot with wrist-mounted and third-person cameras, the model achieves a 7.3% success boost in clustered settings, scaling to unseen objects/configurations with +20.6%.
Multiple Task Types: Robust performance is seen in drawer opening/closing, object sorting, and sandwich assembly, handling cluttered and dynamic scenes.
Long-horizon Reasoning: The explicit spatial priors and latent planning tokens allow the model to perform sequential decomposition of multi-step tasks (such as sorting multiple objects or physical math problem solving) into atomic actions. The model accommodates disturbances by recomputing spatial cues and adapting plans online, resulting in over 10% improvement on long-horizon tests compared to prior approaches.

7. Open Resources and Community Impact

InternVLA-M1’s code, models, and datasets are made openly available:

Code: https://github.com/InternRobotics/InternVLA-M1
Model Checkpoints: https://huggingface.co/collections/InternRobotics/internvla-m1-68c96eaebcb5867786ee6cf3
Datasets: https://huggingface.co/datasets/InternRobotics/InternData-M1

The release of these resources enables replication, extension, and deployment of the spatially guided VLA paradigm for diverse research and industrial applications.

InternVLA-M1 establishes spatially guided training—specifically, explicit spatial grounding and post-trained spatial action conditioning—as a unifying approach to scalable, general-purpose robot policy learning. The framework’s empirical success demonstrates its efficacy for robust, instruction-conditioned control in both simulated and real-world settings, with significant improvements for spatial reasoning, cross-task generalization, and reasoning under long-horizon or perturbed conditions (Chen et al., 15 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy (2025)

Follow Topic

Get notified by email when new papers are published related to InternVLA-M1.