DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

Published 1 Apr 2026 in cs.CV, cs.AI, and cs.RO | (2604.00813v1)

Abstract: End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents DVGT-2, a novel model that leverages dense metric 3D geometry as a core representation for precise trajectory planning in autonomous driving.
It employs a sliding-window streaming approach to achieve constant O(1) latency (~260ms/frame) and memory usage, enabling real-time inference.
Empirical evaluations demonstrate state-of-the-art performance in local ray depth estimation and closed-loop planning across diverse benchmarks.

DVGT-2: A Vision-Geometry-Action Model for Scalable Autonomous Driving

Paradigm Shift: From VLA to VGA in Autonomous Driving

Recent end-to-end autonomous driving models have shifted from sparse perception-driven pipelines to vision-language-action (VLA) architectures that leverage VLMs for semantic context and decision making. However, language-centric models suffer from ambiguity and lack the spatial precision required for robust planning. "DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale" (2604.00813) advocates a paradigmatic transition to Vision-Geometry-Action (VGA), positing dense metric 3D geometry as the critical intermediate representation. VGA integrates fine-grained, pixel-aligned spatial information as the primary driver for trajectory planning, providing a more exhaustive and precise counterpart to VLA's high-level—but inherently lossy—scene abstraction.

Figure 2: Comparative overview of paradigms in autonomous driving; VGA reconstructs dense 3D geometry, surpassing sparse and language-based intermediates.

DVGT-2 Architecture: Efficient Streaming Geometry and Planning

The core innovation in DVGT-2 is a streaming visual geometry transformer that processes online, multi-frame, multi-view inputs in real time, jointly predicting dense 3D pointmaps, current ego-poses (relative), and future trajectories per frame.

Figure 1: Overview of DVGT-2: A streaming architecture that jointly predicts dense geometry, ego state, and planned trajectory for each timestep from multi-camera inputs.

The model comprises:

Image Encoding: A ViT-L (DINOv3 pre-trained) backbone extracts visual tokens from synchronized multi-camera frames.
Geometry Transformer: Factorized attention operates across intra-view, cross-view, and temporal axes, using relative positional encoding (MRoPE-I) to facilitate robust long-range aggregation without growing the cache.
Prediction Heads: Specialized heads decode (i) dense 3D pointmaps (DPT head), (ii) relative ego pose (anchor-based diffusion), and (iii) future ego trajectory (anchor-based diffusion).
Figure 4: DVGT-2 architecture: image encoder, geometry transformer, and heads for joint geometry and action prediction.

Sliding-Window Streaming: Constant-Cost Real-Time Inference

Contrasting with prior global and streaming geometry models (e.g., DVGT, StreamVGGT), DVGT-2 employs a fixed-size window to cache historical intermediate features, eradicating the computational and memory bloat associated with global or full-history paradigms. By reconstructing local geometry (per-frame, ego-centric) and accumulating only relative pose estimates, the model achieves:

$\mathcal{O}(1)$ per-frame memory and latency independent of sequence length or drive duration.
Elimination of redundant reprocessing of historical frames, a key limitation in batch and naïve streaming paradigms.
Figure 3: Efficient online inference with sliding-window caching: only the most recent $W$ frames are processed, yielding constant compute and memory.

Empirical measurements validate that DVGT-2 maintains stable latency (~260ms/frame) and flat memory usage across hundreds of frames, while alternatives suffer from quadratic or linear growth, ultimately leading to out-of-memory scenarios.

Figure 5: Memory usage comparison: DVGT-2 sustains constant cost versus catastrophic growth in prior models.

Geometry and Planning Performance: Quantitative and Qualitative Analysis

DVGT-2 achieves state-of-the-art performance in local ray depth estimation (Abs Rel, $\delta < 1.25$ ) across multiple datasets (OpenScene, nuScenes, Waymo, KITTI, DDAD), with competitive results in global point reconstruction despite lacking access to full trajectory context during inference.

On OpenScene, DVGT-2 attains Abs Rel = 0.040, $\delta < 1.25$ = 0.977, far surpassing both full-sequence and previous streaming methods.
On nuScenes, DVGT-2 outperforms all but the heaviest batch models in both accuracy and latency.

In closed-loop planning on NAVSIM v2, DVGT-2 achieves EPDMS = 88.9, outperforming all published SOTA, including VLA-based, occupancy-based, and multi-modal models. Open-loop planning on nuScenes yields an average L2 error of 0.78m and a collision rate of 0.19%, setting new benchmarks in safety and reliability.

Figure 7: Qualitative results: DVGT-2 reconstructs high-fidelity geometry and robust future trajectories from complex multi-view driving scenes.

Theoretical and Practical Implications

Adopting dense geometry as the central bridge between perception and action has significant implications:

Theoretical: VGA dispenses with heuristic, application-specific sparse representations and language auxiliaries, leveraging continuous geometry for end-to-end optimization. The anchor-based diffusion for both pose and planning integrates uncertainty-aware, multi-modal distribution modeling into a metric-geometric pipeline.
Practical: Sliding-window streaming supports unbounded real-world deployment without expensive fine-tuning or manual calibration across platforms. The annotation efficiency of geometry-based supervision reduces dependency on scarce labeled data.

DVGT-2's architecture can be adapted to alternative sensor configurations, extended camera setups, and rapidly varying environmental contexts with minimal modification—demonstrating robust out-of-domain generalization by direct transfer.

Limitations and Future Developments

The local-to-global aggregation for pose inference introduces cumulative drift, limiting long-term global consistency. Heavier predictors or global correction modules could help, but at the cost of real-time operation. Furthermore, the strong reliance on geometric consistency may underperform in situations where non-metric, high-level semantics are indispensable (e.g., reasoning about invisible agents or intent).

Future works may focus on:

Integrating semantic priors with dense geometry for improved reasoning about occluded agents.
Hybridizing world model architectures with VGA for more expressive, controllable simulation.
Extending diffusion-based planners to longer horizons with self-correcting mechanisms for pose drift.

Conclusion

DVGT-2 operationalizes the Vision-Geometry-Action paradigm for scalable, annotation-efficient autonomous driving. Through online, streaming inference with fixed resource cost and joint prediction of geometry and plan, it achieves leading accuracy and safety across industry-standard benchmarks. The methodology marks a clear trajectory toward geometry-centric autonomous systems with robust, transferable, efficient planning modules, and opens new frontiers for integration with world models and semantic reasoning engines.

Markdown Report Issue