Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

Published 14 Apr 2026 in cs.RO | (2604.12908v1)

Abstract: At its core, robotic manipulation is a problem of vision-to-geometry mapping ($f(v) \rightarrow G$). Physical actions are fundamentally defined by geometric properties like 3D positions and spatial relationships. Consequently, we argue that the foundation for generalizable robotic control should be a vision-geometry backbone, rather than the widely adopted vision-language or video models. Conventional VLA and video-predictive models rely on backbones pretrained on large-scale 2D image-text or temporal pixel data. While effective, their representations are largely shaped by semantic concepts or 2D priors, which do not intrinsically align with the precise 3D geometric nature required for physical manipulation. Driven by this insight, we propose the Vision-Geometry-Action (VGA) model, which directly conditions action generation on pretrained native 3D representations. Specifically, VGA replaces conventional language or video backbones with a pretrained 3D world model, establishing a seamless vision-to-geometry mapping that translates visual inputs directly into physical actions. To further enhance geometric consistency, we introduce a Progressive Volumetric Modulation module and adopt a joint training strategy. Extensive experiments validate the effectiveness of our approach. In simulation benchmarks, VGA outperforms top-tier VLA baselines including $π{0.5}$ and GeoVLA, demonstrating its superiority in precise manipulation. More importantly, VGA exhibits remarkable zero-shot generalization to unseen viewpoints in real-world deployments, consistently outperforming $π{0.5}$. These results highlight that operating on native 3D representations-rather than translating through language or 2D video priors-is a highly promising direction for achieving generalizable physical intelligence.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a Vision-Geometry-Action (VGA) model that maps multi-view visual inputs directly to 3D representations for precise robotic manipulation.
It demonstrates exceptional performance with a 98.1% success rate in simulation and robust zero-shot generalization in real-world tests.
The study challenges traditional vision-language backbones by emphasizing native 3D spatial reasoning and parameter-efficient fine-tuning via LoRA.

Rethinking Robotic Manipulation: Vision-to-Geometry Mapping via Vision-Geometry Backbones

Problem Statement and Motivation

The paper “Robotic Manipulation is Vision-to-Geometry Mapping ( $f(v) \rightarrow G$ ): Vision-Geometry Backbones over Language and Video Models” (2604.12908) scrutinizes the backbone paradigms for robotic policy learning. It asserts that physical robotic manipulation is fundamentally a vision-to-geometry problem, where geometric consistency—rather than semantic or 2D pixel-space correlation—should underpin robotic action generation. This framework is in direct contrast to the prevalent vision-language-action (VLA) and video-predictive model architectures leveraging vision-LLMs (VLMs) or video diffusion transformers, both of which rest on large-scale 2D or spatio-temporal pretraining datasets and are thus prone to overfit 2D priors, failing to capture the essential 3D spatial reasoning required for manipulation tasks.

Figure 1: The vision-to-geometry mapping conceptualization, illustrating that geometric entities (positions, orientations, spatial structure) prescribe robot action, motivating a geometry-grounded model foundation.

VGA: A Vision-Geometry-Action Architecture

The heart of the proposed solution is the Vision-Geometry-Action (VGA) model. VGA abandons language and video backbones entirely, instead employing a pretrained 3D world model—specifically, VGGT—to map multi-view visual observations directly into comprehensive, native 3D representations. Action heads condition on these representations, giving rise to a seamless, geometrically grounded policy pipeline.

VGA’s multi-modal input includes multi-view RGB observations, language instructions, and robot proprioception. These signals are tokenized and processed with a transformer backbone that alternates between frame-wise intra-modality and cross-modality attention, constructing a token grid deeply aligned with geometric structure. The VGA decoding stage utilizes a Progressive Volumetric Modulation (PVM) module, optimally fusing these features for action prediction. VGA is optimized with a joint loss on both the physical actions and auxiliary 3D properties (camera parameters, depth maps) to enforce spatial reasoning consistency.

Figure 2: Workflow and structural overview of VGA, illustrating multi-modal tokenization, geometric cross-attention, progressive volumetric modulation, and unified representation for downstream action and 3D attribute prediction.

Key architectural innovations include:

Native 3D backbone: Direct use of a pretrained geometric transformer (VGGT) for all perception-to-action information flow.
Progressive Volumetric Modulation (PVM): Layer-wise injection and alignment of geometric priors for action generation.
Joint training: Simultaneous prediction of 3D geometric attributes and robot actions to maximize cross-modal geometric consistency.
LoRA parameter-efficient fine-tuning: Selective adaptation preserving backbone priors.

Simulated and Real-World Empirical Evaluation

Simulation Results: LIBERO Benchmark

VGA’s capabilities are analyzed on the LIBERO benchmark, which entails diverse manipulation tasks spanning spatial, object, compositional goal, and long-horizon reasoning. Evaluation focuses on task success rates, geometric prediction quality, and robustness to design ablations.

Numerical results demonstrate that VGA achieves a top-1 average success rate (98.1%) across LIBERO suites, outperforming state-of-the-art VLA models ( $\pi_{0.5}$ , GeoVLA, OpenVLA-OFT) and even the strongest 3D-VLA and World Action Model (WAM) baselines.

Figure 3: Simulation rollouts with corresponding VGA depth predictions highlight the model’s precise geometric scene understanding and strong manipulation performance.

Ablation studies validate the necessity of each design component: removal of PVM or joint training leads to up to 2.4% and 0.9% performance drops, respectively. The pretrained 3D backbone is essential; random initialization (even with LoRA) leads to catastrophic degradation (down to 6.4% success).

Auxiliary Geometric Prediction

Depth and camera parameter prediction is demonstrably accurate, especially for task-relevant scene regions (robot gripper, target objects), affirming that VGA’s representations retain high-fidelity 3D information critical for real-world transfer.

Real-World Experiments: Zero-Shot Generalization

Physical validation is achieved on a Franka Panda platform with multiple camera configurations. Three tasks (cube pick, button press, stack cube) quantify both in-distribution and extreme out-of-distribution (unseen camera viewpoint) performance.

VGA’s zero-shot generalization is particularly notable: when deployed with camera configurations unseen during training, VGA outperforms all baselines, including $\pi_{0.5}$ . While ACT and OpenVLA exhibit a rapid collapse in success under OOD conditions (7% and 3% respectively), VGA achieves a 58% average success rate, surpassing $\pi_{0.5}$ by 6%. This finding is a strong contradiction of the assumption that generalist VLMs suffice for robust robotic transfer.

Figure 4: Real-world experiment configuration illustrating in-distribution and out-of-distribution camera placements for rigorous spatial generalization tests.

Figure 5: Visualized real-world VGA rollouts, demonstrating geometric robustness under both seen and novel observation configurations.

Language-Grounded Manipulation

VGA robustly grounds semantic language instructions to spatial actions, as evidenced by precise object selection in layouts with visually similar distractors.

Figure 6: Robustness of VGA in grounded grasping tasks across various object layouts, confirming effective visual-language-geometry integration.

Theoretical and Practical Implications

This work substantiates that geometry-grounded architectures fundamentally outperform those that build on semantic or video-centric representations for spatially-sensitive embodied control. Several theoretical implications are clear:

Elimination of 2D bottleneck: By avoiding representation flattening and reconstructing, VGA maintains metric and volumetric consistency, yielding policies more robust to viewpoint and spatial configuration shifts.
Separation of perception modalities: Results question the necessity of video or language backbones except when explicit commonsense reasoning is required—geometry alignment takes clear precedence for manipulation.
Data efficiency and parameter economy: Fine-tuning with LoRA on pretrained geometric representations enables rapid convergence with a fraction of the trainable parameters and strong data efficiency.

On the practical side, these findings support the design of future generalist robots with robust, sensor-invariant manipulation capabilities without reliance on specialized 3D sensors or massive-scale video or language pretraining. A remaining limitation of VGA is its relatively weaker performance on tasks requiring long-horizon memory or high-level semantic reasoning, due to the focus on 3D scene understanding rather than sequence modeling at scale.

Future Directions

Further scaling VGGT-style 3D backbones with more diverse, temporally-extended datasets could close existing gaps on long-horizon tasks. Additionally, modular fusion with VLMs or explicit commonsense/semantic planning layers could re-introduce reasoning capabilities without sacrificing geometric fidelity. The study suggests that for embodied AI, pretraining on synthetic or recorded 3D environments may be a more fruitful path to generalization than mining ever-larger language or action datasets.

Conclusion

The paradigm shift from vision-language or video-centric pretraining to a strictly vision-to-geometry mapping—a native 3D world model backbone—enables unified, robust, and generalizable physical intelligence for robotic manipulators. The empirical, architectural, and theoretical evidence presented establishes geometry-anchored models as the leading backbone choice for embodied manipulation.

Markdown Report Issue