Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VGGT: Visual Geometry Grounded Transformer (2503.11651v1)

Published 14 Mar 2025 in cs.CV

Abstract: We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.

Summary

  • The paper introduces VGGT, a feed-forward Transformer that unifies multiple 3D vision tasks without iterative optimization.
  • It leverages cross-view and spatial attention mechanisms to extract rich geometric information from varied image inputs.
  • Its end-to-end design enables state-of-the-art performance in camera pose, depth, point map estimation, and 3D tracking in under one second.

VGGT (Visual Geometry Grounded Transformer) is a feed-forward neural network architecture designed to infer a comprehensive set of 3D scene attributes directly from monocular, few-view, or multi-view image inputs (VGGT: Visual Geometry Grounded Transformer, 14 Mar 2025). Unlike specialized models targeting individual 3D vision tasks, VGGT proposes a unified framework capable of simultaneously estimating camera parameters, dense point maps, depth maps, and establishing 3D point tracks across views. A key characteristic is its purely feed-forward nature during inference, eschewing the iterative optimization or explicit geometric bundle adjustment steps common in traditional Structure-from-Motion (SfM) or Multi-View Stereo (MVS) pipelines, while still achieving competitive or superior performance.

Architecture and Processing Pipeline

The VGGT model is built upon a Transformer architecture, enabling it to process a variable number of input views, ranging from a single image up to hundreds. The core idea is to leverage attention mechanisms to effectively aggregate information across different views and spatial locations to build a unified 3D representation.

  1. Input Encoding: Each input image is initially processed by a feature extractor (e.g., a CNN like ResNet or a Vision Transformer) to obtain 2D feature maps. Positional encodings are added to these features to retain spatial information.
  2. Cross-View Attention: A central component of VGGT is likely a form of cross-view attention. This allows tokens (representing spatial locations or patches) from one view to attend to tokens in other views. This mechanism is crucial for establishing correspondences and understanding the epipolar geometry implicitly. The attention mechanism computes pairwise interactions between feature representations across different images, effectively identifying corresponding points or regions.
  3. Spatial Self-Attention: Within each view, spatial self-attention layers can refine the feature representations by capturing intra-image context and spatial relationships.
  4. Unified Representation: Through stacked layers of cross-view and spatial attention, the model builds an intermediate representation that encodes the geometric relationships between views and the 3D structure of the scene.
  5. Prediction Heads: This unified representation is then fed into multiple, task-specific prediction heads, implemented typically as small Multi-Layer Perceptrons (MLPs) or linear layers:
    • Camera Parameter Head: Predicts relative or absolute camera poses (rotation R\mathbf{R} and translation t\mathbf{t}) for each input view. For a pair of views (i,j)(i, j), it might predict the relative pose Pij=[Rijtij]\mathbf{P}_{ij} = [\mathbf{R}_{ij} | \mathbf{t}_{ij}]. For multiple views, it could predict poses relative to a canonical frame or pairwise relative poses.
    • Depth Map Head: Predicts a dense depth map DiD_i for each input view ii. This is likely formulated as a per-pixel regression or classification task.
    • Point Map Head: Infers a "point map," potentially representing the 3D coordinates Xp\mathbf{X}_p for each pixel pp in a canonical coordinate system or relative to the camera, derived implicitly from depth and camera parameters or predicted directly.
    • 3D Point Track Head: Identifies and predicts the 3D trajectories Xk(t)\mathbf{X}_k(t) for a set of sparse keypoints kk across the sequence of views tt. This involves establishing long-range correspondences and predicting the consistent 3D location of these points.

The feed-forward nature implies that a single pass through the network yields all outputs simultaneously without requiring iterative refinement or external geometric solvers during inference.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
function VGGT_inference(images):
    # images: List of input images [img_1, img_2, ..., img_N]

    # 1. Feature Extraction
    features = []
    for img in images:
        feature_map = CNN_or_ViT_backbone(img)
        features.append(add_positional_encoding(feature_map))

    # 2. Transformer Encoding (Cross-View and Spatial Attention)
    # Simplified representation of attention layers
    encoded_representation = features
    for layer in transformer_layers:
        # Cross-view attention aggregates information across views
        cross_view_out = cross_view_attention(encoded_representation)
        # Spatial attention refines features within each view
        spatial_out = spatial_attention(cross_view_out)
        encoded_representation = layer_norm(spatial_out + encoded_representation) # Residual connection & Norm

    # 3. Prediction Heads
    camera_params = []
    depth_maps = []
    point_maps = []
    point_tracks = None # May depend on input type (sequence vs unordered set)

    for i in range(N):
        view_representation = encoded_representation[i]

        # Predict camera parameters (e.g., relative to view 0 or absolute)
        cam_param = camera_head(view_representation)
        camera_params.append(cam_param)

        # Predict depth map
        depth_map = depth_head(view_representation)
        depth_maps.append(depth_map)

        # Predict point map (derived or direct)
        point_map = point_map_head(view_representation, cam_param, depth_map) # Example dependency
        point_maps.append(point_map)

    # Predict point tracks if applicable (across all views' representations)
    if is_sequence(images):
        point_tracks = point_track_head(encoded_representation)

    return camera_params, depth_maps, point_maps, point_tracks

Training Paradigm

Training VGGT involves optimizing the network parameters using a composite loss function that combines objectives for each of the output tasks.

  • Camera Pose Loss: Typically involves minimizing the difference between predicted camera poses (Rpred,tpred\mathbf{R}_{pred}, \mathbf{t}_{pred}) and ground truth poses (Rgt,tgt\mathbf{R}_{gt}, \mathbf{t}_{gt}). This could use metrics like the geodesic distance for rotations and L2 distance for translations, possibly weighted. For relative poses, the loss is computed on ΔP=PpredPgt1\Delta \mathbf{P} = \mathbf{P}_{pred} \mathbf{P}_{gt}^{-1}.
  • Depth Loss: Standard depth estimation losses are applicable, such as scale-invariant logarithmic error (SILog), L1 loss, or multi-scale gradient matching losses, computed between the predicted depth map DpredD_{pred} and the ground truth depth DgtD_{gt}.
  • Point Map / Reconstruction Loss: A Chamfer distance or L1/L2 loss between the predicted 3D points (derived from depth and camera poses, or directly predicted) and the ground truth point cloud could be used. Lrecon=L(π(Ppred,K),π(Pgt,K))L_{recon} = \mathcal{L}(\pi(P_{pred}, K), \pi(P_{gt}, K)), where π\pi projects 3D points to 2D. Alternatively, reprojection errors based on predicted correspondences could be minimized.
  • Point Track Loss: For point tracking, the loss would penalize deviations between the predicted 3D track locations Xk,pred(t)\mathbf{X}_{k, pred}(t) and the ground truth locations Xk,gt(t)\mathbf{X}_{k, gt}(t) over time tt. This might involve an L2 distance averaged over points kk and time steps tt. Visibility prediction might also be part of the objective.

The model is trained end-to-end, allowing the different tasks to benefit from shared representations learned within the Transformer backbone. The specific weighting of these loss components is a critical hyperparameter. Training requires large-scale datasets providing ground truth for camera poses, depth maps, and potentially point tracks (e.g., synthetic datasets like TartanAir or real-world datasets with SfM/SLAM ground truth like ScanNet or CO3D).

Performance and Efficiency

VGGT demonstrates strong performance across multiple 3D vision benchmarks (VGGT: Visual Geometry Grounded Transformer, 14 Mar 2025). It achieves state-of-the-art results in camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction from multiple views, and 3D point tracking. Notably, these results are obtained without resorting to post-processing steps involving explicit geometric optimization techniques like bundle adjustment or patch matching refinement, which are common in traditional MVS pipelines (e.g., COLMAP) or some learning-based methods.

A significant practical advantage highlighted is its inference efficiency. The paper claims reconstruction (implying the generation of depth/point clouds and potentially camera poses) can be performed in under one second. This rapid inference is a direct consequence of its feed-forward design, contrasting sharply with optimization-based methods that can take minutes or hours depending on scene complexity and the number of views.

Task VGGT Performance Claim (VGGT: Visual Geometry Grounded Transformer, 14 Mar 2025) Comparison Point
Camera Estimation State-of-the-art Methods with/without geometric optimization
Multi-View Depth Estimation State-of-the-art Optimization-based MVS (e.g., COLMAP), Learned MVS
Dense Reconstruction State-of-the-art Optimization-based SfM/MVS, Learned Reconstruction
3D Point Tracking State-of-the-art Classical KLT, Learned trackers
Inference Speed < 1 second for reconstruction Optimization methods (minutes/hours)

Downstream Task Enhancement

Beyond its direct outputs, VGGT serves as a potent feature backbone for other vision tasks. The intermediate representations learned by a pretrained VGGT model encapsulate rich geometric and semantic information about the 3D scene. The paper demonstrates this by applying pretrained VGGT features to enhance performance on:

  1. Non-Rigid Point Tracking: By providing robust initializations or features that encode scene geometry and appearance consistency across views, VGGT aids in tracking points on surfaces undergoing non-rigid deformations. The features likely provide better correspondence cues compared to purely 2D appearance features.
  2. Feed-Forward Novel View Synthesis (NVS): VGGT's ability to predict geometry (depth, point maps) and camera poses directly enables its use in NVS frameworks. The predicted geometry can be used for warping and blending features from input views to synthesize novel views, potentially within a feed-forward NVS architecture, avoiding per-scene optimization common in methods like NeRF.

This capability suggests that VGGT learns a general-purpose representation of multi-view geometry that is transferable to related problems.

Implementation Considerations

  • Model Size and Compute: As a Transformer-based model, VGGT's computational requirements (memory and FLOPs) likely scale with the number of input views and the resolution of internal representations. Handling hundreds of views efficiently might require techniques like sparse attention or hierarchical processing. Training such a model demands significant GPU resources and large datasets.
  • Feed-Forward Limitation: While efficient, the purely feed-forward nature means the model cannot perform test-time refinement or incorporate geometric constraints explicitly during inference (e.g., enforcing bundle adjustment consistency). Performance might degrade in scenarios with severe ambiguities or violations of learned priors, where optimization can sometimes help.
  • Generalization: Performance relies on the diversity and scale of the training data. Generalization to out-of-distribution scenes, camera configurations, or object types remains a standard challenge for learning-based methods.
  • Code Availability: The authors have made the code and pretrained models publicly available at https://github.com/facebookresearch/vggt, facilitating replication and downstream use. This allows practitioners to directly apply VGGT or fine-tune it for specific applications.

In conclusion, VGGT presents a significant step towards unified, efficient, and high-performance 3D scene understanding directly from images. Its feed-forward Transformer architecture simultaneously tackles multiple core 3D vision tasks, achieving strong results without relying on traditional geometric optimization during inference. Furthermore, its learned representations show promise as a foundation for enhancing various downstream applications requiring geometric reasoning.

Github Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. VGGT: Visual Geometry Grounded Transformer (190 points, 42 comments)