Spatially Guided VLA Training
- The paper introduces a dual-stage training approach that combines spatial pre-training with spatially guided post-training to coherently link language instructions to embodied actions.
- It leverages large-scale, diverse datasets and scalable simulation pipelines to embed explicit spatial cues, resulting in significant performance improvements on various robotic benchmarks.
- This framework enhances long-horizon planning and generalization in robotic systems, enabling adaptable manipulation in dynamic, real-world environments.
Spatially guided vision-language-action (VLA) training refers to methodologies and frameworks that explicitly incorporate spatial grounding and reasoning into the learning pipeline that links perception (vision), linguistic instruction (language), and sequential robotic or agent action (action). This approach represents a unifying principle in the development of instruction-following robots and embodied intelligence, aiming to achieve scalable, general-purpose robotic manipulation by tightly coupling spatial understanding with both high-level language interpretation and low-level embodiment-aware control (Chen et al., 15 Oct 2025).
1. Conceptual Foundations of Spatially Guided VLA Training
Spatially guided VLA training is premised on the hypothesis that robust grounding of language to actions in the physical world fundamentally requires explicit spatial representations. The paradigm is distinguished from conventional VLA models by the adoption of spatial grounding at both pre-training and action-planning stages:
- Spatial pre-training focuses on aligning language and visual observations to explicit spatial constructs—such as points, bounding boxes, and trajectories—across diverse embodiments and scenarios.
- Spatially guided post-training tunes the model to generate actions that are physically and semantically consistent with these spatial cues, taking embodiment-specific constraints into account.
This two-stage approach ensures that language instructions, visual features, and motor outputs are coherently linked via shared spatial reasoning, addressing key challenges in generalization, transfer, and robustness for real-world robotic control (Chen et al., 15 Oct 2025).
2. Spatial Grounding Pre-Training
The first stage in spatially guided VLA training specializes in the acquisition of strong spatial priors through large-scale pre-training on diverse spatial reasoning datasets. In InternVLA-M1, this involves over 2.3 million samples drawn from sources such as RefCOCO, RoboRefIt, A0, MolmoAct, and PixMo-Points, all reformatted into a unified QA-style format. The pre-training objective is to teach the vision-LLM (VLM) to map modality-agnostic language instructions to spatial outputs—specifically:
- Bounding-box detection
- Point localization
- Visual trajectory prediction
- Affordance recognition
The training objective can be formally described by minimizing the distance between language and spatial feature embeddings:
where encodes the instruction, encodes the spatial feature, and spatial positions are standardized as absolute coordinates.
Such pre-training is “embodiment-agnostic”—it decouples spatial reasoning from embodiment-specific kinematics or dynamics, instead providing transferable spatial priors essential for the downstream robotic action policy (Chen et al., 15 Oct 2025).
3. Spatially Guided Action Post-Training and Plug-and-Play Spatial Prompting
Post-training integrates the abstracted spatial priors from pre-training with embodiment-specific action generation. The key innovation is spatial prompting, wherein explicit spatial cues are provided as part of the action-generation prompt. For instance, task instructions are augmented to include sub-prompts such as, "identify all relevant toys and their spatial relationships to the container." This enables the planner to generate spatially explicit, latent planning tokens that are then consumed by the action expert module.
Optimization is performed with a dual supervision strategy:
- L2 loss for matching predicted and ground-truth noise perturbations in robot trajectories.
- Next-token prediction loss for the vision-language module on spatial grounding tasks.
A gradient decay factor (e.g., 0.5) is used to modulate the influence of the action expert on the perception module:
This architectural decoupling maintains the planner's high-level semantic reasoning while enabling effective joint optimization of spatial planning and motor control (Chen et al., 15 Oct 2025).
4. Performance Metrics and Empirical Assessment
Spatially guided VLA training leads to significant empirical gains over “vanilla” VLA approaches:
- On the SimplerEnv Google Robot Visual Matching benchmark: up to +14.6% absolute improvement.
- On the WidowX robot benchmark: up to +17.0% absolute improvement.
- On LIBERO Franka tasks: +4.3% gain.
Spatial reasoning ability is further validated using box Intersection-over-Union (IoU) and point accuracy metrics, as well as Projection-space Similarity (PSS), which increased from ~0.25 (vanilla) to 0.42 (spatially guided training), signaling improved gradient alignment and convergence.
In real-world clustered pick-and-place tasks, InternVLA-M1 achieved a +7.3% improvement, and when combined with synthetic data for co-training, achieved a +20.6% improvement on previously unseen objects and spatial configurations. In long-horizon, multistep scenarios, success rates surpass competing works by over 10% (Chen et al., 15 Oct 2025).
5. Data Generation and Simulation Infrastructure
A critical enabler for spatially guided training is the use of scalable simulation pipelines for data collection. The InternVLA-M1 framework incorporates a custom simulation engine (based on GenManip and Isaac Sim) that:
- Randomizes object placement, lighting, and scene geometry.
- Uses privileged information (object pose, mesh, and robot state) for candidate grasp computation and verification.
- Decouples physics simulation from rendering for high throughput.
This pipeline generated 244,000 pick-and-place episodes across more than 3,000 objects. The resulting large-scale dataset boosts generalization across 200 tasks by 6.2% on average (Chen et al., 15 Oct 2025).
6. Generalization, Long-Horizon Reasoning, and Practical Applications
Spatially guided VLA models demonstrate strong generalization to real-world, clustered, and novel-object scenarios, largely due to the explicit spatial grounding learned in pre-training. The end-to-end architecture, with spatial prompting and temporal consistency strategies, supports robust, long-horizon planning under dynamic and perturbed conditions.
In practical deployments, the framework supports instruction following over extended task sequences (e.g., desktop sorting, drawer manipulation, sandwich assembly) and exhibits adaptability to previously unseen objects and layouts.
Open-source code and pretrained models for InternVLA-M1 are available, supporting reproducibility and broader adoption in research and applied settings (see https://github.com/InternRobotics/InternVLA-M1).
7. Summary and Significance
Spatially guided vision-language-action training, as exemplified by the InternVLA-M1 framework, introduces spatial grounding as a critical linking mechanism between language understanding and robotic action. The combination of embodiment-agnostic spatial pre-training, spatially guided action post-training, scalable simulation-based data generation, and explicit spatial prompting provides substantial and consistent gains across diverse benchmarks, advancing the state of scalable, robust, and generalist robot manipulation (Chen et al., 15 Oct 2025). This approach sets a template for future research on resilient and adaptable embodied AI systems.