GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Published 7 Jun 2026 in cs.RO and cs.AI | (2606.08530v2)

Abstract: Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper introduces a two-stage, coarse-to-fine training pipeline that leverages latent action tokens and continuous action predictions to achieve robust manipulation.
The methodology integrates a freeze-fused 2D representation with a trainable 3D spatial encoder to capture explicit geometric cues for improved object localization.
Embodiment canonicalization enables the model to generalize across heterogeneous robot embodiments, achieving state-of-the-art results in both simulation and real-world benchmarks.

Geometry-Aware Action Representations for Generalizable Robotic Manipulation: An Analysis of GEAR-VLA

Introduction

The "GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation" (2606.08530) paper addresses persistent limitations of Vision-Language-Action (VLA) models in robotic manipulation, especially concerning generalization across unseen objects, environmental conditions, and heterogeneous robot embodiments. While state-of-the-art VLA systems demonstrate strong benchmark results, they exhibit substantial gaps in robustness and transferability in real-world deployment. The core thesis is that these limitations stem from a lack of unified action representations that are both geometry-aware and embodiment-invariant, leading to poor cross-embodiment transfer and susceptibility to distribution shifts.

Methodological Framework

Coarse-to-Fine Action Learning

GEAR-VLA introduces a two-stage training pipeline—coarse-to-fine action learning. First, an embodied VLM backbone is pretrained autoregressively on a large corpus of vision-language datasets, spatial grounding, trajectory reasoning, and manipulation videos. Crucially, two discrete action supervision signals are combined: FAST-style action tokens from robot trajectories, and latent action IDs distilled from action-free videos via a causal VQ-VAE tokenizer. This approach enables the model to internalize high-level action semantics from both annotated robot data and unannotated visual dynamics, producing action-relevant latent representations that generalize across visual domains.

Next, continuous action chunk prediction is decoupled from the VLM backbone using a gradient-stopped DiT-based action expert. Only the cache of latent action tokens from the VLM serves as the input for continuous prediction, preventing error propagation and semantic drift due to low-level trajectory fitting. The flow-matching objective ensures efficient mapping from semantic action intent to robot-executable trajectories without perturbing the learned representation.

Semantic-Aligned 3D Integration

GEAR-VLA augments 2D VLM representations with a trainable 3D spatial encoder (VGGT), utilizing multi-view consistency for explicit geometric structure modeling. To avoid disrupting pretrained vision-language alignment, the architecture freezes the 2D semantic encoder, zero-initializes the 3D branch, and fuses features through an expanded visual projector. Gradual integration ensures stable optimization: 2D features preserve language grounding, while 3D features contribute geometry-aware cues essential for manipulation in varied and cluttered environments.

Embodiment Canonicalization

A core challenge in large-scale robot policy learning is handling kinematic and state-space heterogeneity. GEAR-VLA introduces embodiment canonicalization by structuring inputs as embodiment-aware state embeddings (end-effector pose and joint angles) projected to a unified representation and outputs as relative end-effector actions (SE(3) deltas anchored to current pose). Embodiment differences are thus confined to a lightweight, robot-specific state projector, and all high-level policy learning operates in an embodiment-agnostic space. This design obviates the need for robot-specific policy heads or prompt engineering, allowing efficient adaptation to unseen robots with minimal data and fine-tuning.

Empirical Evaluation

GEAR-VLA achieves state-of-the-art generalization on a comprehensive suite of simulation (LIBERO, LIBERO-Plus, RoboTwin 2.0) and real-world manipulation benchmarks. The system consistently outperforms leading baselines across several dimensions:

Simulation Performance: Achieved 98.7% on LIBERO, 88.7% zero-shot on LIBERO-Plus, and 91.1%/89.9% on RoboTwin 2.0 (clean/randomized), surpassing ACoT, X-VLA, and other previous methods.
Bimanual Manipulation: In three real-world tasks on AgileX (14-DoF dual-arm), reached 85.9% success (200 demos/task, tested on unseen object appearances). On the previously unseen LDT-01 robot (16-DoF), achieved 81.0% success, evidencing strong cross-embodiment transfer.
Universal Object Grasping: On a large-scale benchmark (6,360 trials over 212 unseen objects), obtained 90.1% average success, outperforming To.5 (79.1%) and DexGraspVLA (84.4%). Particularly, the system excelled on irregular and tool-like objects under dense clutter and changing light/background.

Ablation studies substantiate that each key component—latent action supervision, 3D geometry, and embodiment canonicalization—contributes significantly to overall robustness and transfer.

Technical Implications and Insights

The empirical evidence underscores that geometry-aware visual reasoning and disentangled embodiment interfaces are necessary for scalable, general-purpose robotic policy learning. GEAR-VLA’s architecture decouples semantic and low-level physical priors, allowing it to:

Exploit multi-source and multimodal training signals, incorporating latent dynamics from raw videos and concrete robot supervision with minimal manual annotation.
Leverage 3D spatial understanding as a core inductive bias, improving object localization and manipulation in dynamic and cluttered contexts.
Generalize across robot morphologies with minimal architecture or data modifications due to canonicalization, thus reducing data imbalance and platform specificity issues common in prior VLA designs.

The results challenge the efficacy of approaches that use only quantized action tokens, robot-specific prompts, or naively fused 3D representations, showing marked performance drops when these paradigms are substituted for canonicalized, geometry-aware learning.

Broader Impact and Future Research Directions

Practically, GEAR-VLA enables scalable deployment of robotic manipulation policies across fleets of heterogeneous robots, in unstructured settings, and with little adaptation overhead. The universal grasping experiments demonstrate applicability to open-vocabulary, object-centric tasks, indicating utility for service robotics, logistics, and home automation domains.

Theoretically, the findings motivate future work on:

More advanced geometric perception, such as tighter coupling between metric spatial awareness and symbolic task reasoning.
Further leveraging unlabeled human and web video data for action semantics distillation, reducing reliance on robot-specific annotations.
Extending cross-embodiment generalization to dynamically reconfigurable systems, multi-agent settings, or direct sim2real transfer at scale.
Exploring continual and online adaptation without catastrophic forgetting via the coarse-to-fine policy interface.

Conclusion

GEAR-VLA represents a significant advance in generalizable robotic manipulation by tightly integrating geometry-aware representations, coarse-to-fine semantic-action policy learning, and embodiment-canonicalized interfaces. The framework achieves high levels of real-world robustness and adaptability, with strong numerical results across simulation and challenging physical environments. These outcomes provide a compelling foundation for the development of universally deployable robotic control policies leveraging vision-language-action pretraining paradigms.

Markdown Report Issue