OmniVGGT: Multimodal VGGT Extension

Updated 3 July 2026

OmniVGGT is a multimodal extension of VGGT that fuses RGB data with auxiliary depth and camera parameters to enable joint prediction of 3D scene outputs.
It employs a lightweight GeoAdapter with distinct branches for camera and depth inputs, using progressive zero-initialized convolutions for stable fusion.
Stochastic multimodal training and ablation studies demonstrate that combining auxiliary cues improves dense depth estimation, pose estimation, and sparse-view reconstruction.

OmniVGGT is a multimodal extension of the Visual Geometry Grounded Transformer (VGGT) that generalizes the original RGB-only feed-forward 3D foundation model to consume an arbitrary subset of auxiliary geometric modalities during both training and inference. In addition to RGB images, it explicitly supports depth maps, depth validity masks, camera intrinsics, and camera extrinsics or poses, while preserving the core VGGT interface of jointly predicting camera parameters, point maps, depth maps, and confidence maps in one forward pass. Its central design combines a lightweight adapter mechanism, termed GeoAdapter, with a stochastic multimodal fusion regimen so that a single model can operate with RGB only, with partial geometric annotations on some frames, or with dense auxiliary inputs across a sequence (Peng et al., 13 Nov 2025).

1. Conceptual position within VGGT-based 3D modeling

VGGT established the underlying paradigm: a feed-forward transformer that directly infers camera parameters, point maps, depth maps, and 3D point tracks from one, a few, or hundreds of views, replacing optimization-heavy multi-stage pipelines with joint scene-level prediction (Wang et al., 14 Mar 2025). OmniVGGT retains this foundation-model framing but addresses a limitation that the OmniVGGT paper identifies as pervasive in general 3D foundation models: most assume RGB-only inputs and ignore auxiliary geometric cues that are often available in practice, such as camera calibration and depth observations (Peng et al., 13 Nov 2025).

The problem setting is therefore not merely multimodal fusion in the abstract. OmniVGGT is designed for partially observed geometric side information. For a sequence of $N$ images, some frames may have camera intrinsics and poses, some may have depth and validity masks, and others may have neither. The model is explicitly trained to accept arbitrary modality sparsity patterns at test time. This means that the multimodal extension is defined over both modality type and modality coverage.

The architecture remains anchored in VGGT’s spatial foundation model. Images are patchified into spatial tokens using a DINO backbone and processed jointly with learnable camera tokens and register tokens by a transformer encoder with Alternating-Attention blocks. The encoder is written as

$(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$

Here $\mathbf{e}_f$ denotes image spatial tokens, $\mathbf{e}_c$ camera tokens, and $\mathbf{e}_r$ register tokens. OmniVGGT preserves this backbone and prediction setup, then augments it with modality-specific geometric injection rather than replacing the core transformer.

A common misconception is that “omni-modality” implies unrestricted sensor generality. The implemented model supports two auxiliary geometric pathways only: camera parameters and depth. The broader framing is about arbitrary combinations of these supported geometric cues, not arbitrary sensor types.

2. GeoAdapter architecture and modality injection

OmniVGGT introduces GeoAdapter as the sole architectural addition to the pretrained VGGT backbone. GeoAdapter has two branches: a Camera Adapter for intrinsics and poses, and a Depth Adapter for depth maps and masks (Peng et al., 13 Nov 2025).

Modality branch	Input representation	Injection mechanism
Camera Adapter	$C=\{K,G\}$ , then $\mathbf{g}=\{\mathbf{q},\mathbf{t},\mathbf{f}\}$	Added to camera tokens through layer-specific zero-initialized convolutions
Depth Adapter	$X=[D;M] \in \mathbb{R}^{2\times H\times W}$	Added directly to spatial tokens after depth encoding
Missing modality	Placeholder token	Selected by binary availability indicator

The Camera Adapter first normalizes pose. The origin is aligned to the first camera and translations are normalized by the average distance of the remaining cameras to that first camera: $s = \frac{1}{Q-1} \sum_{j=2}^{Q} \| t_j - t_1 \|_2,$

$G_{j}^\prime = G_{j}G_{1}^{-1};\ t'_1 = \mathbf{0}; \ t^\prime_i = \frac{t_i - t_1}{s}; \ j=2,\dots,Q.$

The normalized camera is then parameterized as

$(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 0

with quaternion $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 1, translation $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 2, and field of view $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 3.

Each Alternating-Attention layer has its own camera encoder: $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 4 If a frame lacks camera information, the model uses a camera placeholder token $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 5. The auxiliary camera token is then added to the current camera token through a zero-initialized convolution: $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 6 This progressive zero-initialized injection is a central design choice. The paper argues that direct camera-token replacement or aggressive early perturbation destabilizes optimization by disrupting the pretrained representation space. Zero-initialized convolutions allow the model to begin near the original VGGT solution and learn how much camera information to inject layer by layer.

The Depth Adapter is structurally simpler. For a frame with depth supervision, OmniVGGT concatenates the depth map and validity mask: $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 7 A dedicated depth encoder, implemented as a single convolutional layer with kernel size 14, patchifies this depth-plus-mask tensor: $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 8 If depth is absent, a placeholder token $(\hat{\mathbf{e}_c},\hat{\mathbf{e}_r},\hat{\mathbf{e}_f})=\mathcal{E}(\mathbf{e}_c,\mathbf{e}_r,\mathbf{e}_f).$ 9 is used. Depth is then injected directly into the spatial token stream: $\mathbf{e}_f$ 0

The paper is explicit that zero-conv is beneficial for the camera branch but harmful for the depth branch. Because depth is dense and spatially aligned to the image tokens, direct additive fusion works better than progressive suppression. This establishes an asymmetry in the multimodal design: camera information is treated as a global perturbation requiring gradual adaptation, whereas depth is treated as a dense aligned feature field.

3. Stochastic multimodal fusion, supervision, and optimization

OmniVGGT is trained with a stochastic multimodal fusion regimen intended to make the model robust to arbitrary missing-modality patterns (Peng et al., 13 Nov 2025). For a training sequence of length $\mathbf{e}_f$ 1, the procedure samples $\mathbf{e}_f$ 2 uniformly for the number of images with ground-truth camera parameters and assigns those camera annotations to the first $\mathbf{e}_f$ 3 images. Independently, it samples $\mathbf{e}_f$ 4 for the number of images with ground-truth depth and assigns those depth annotations to randomly selected frame indices. In addition, with probability $\mathbf{e}_f$ 5, the full batch is trained using RGB only.

This regimen exposes the model to RGB-only, depth-only, camera-only, mixed, sparse-depth, sparse-camera, and dense multimodal cases without changing the model architecture. The intended effect is not only missing-modality robustness, but also a stronger shared spatial representation that does not overfit to a fixed modality template.

The input and output interface is summarized by

$\mathbf{e}_f$ 6

and

$\mathbf{e}_f$ 7

The outputs are predicted camera parameters $\mathbf{e}_f$ 8, point maps $\mathbf{e}_f$ 9, depth maps $\mathbf{e}_c$ 0, and confidence maps $\mathbf{e}_c$ 1.

Training uses the same multitask objective structure as VGGT: $\mathbf{e}_c$ 2 Camera supervision is an $\mathbf{e}_c$ 3 loss on the normalized camera parameter vector: $\mathbf{e}_c$ 4 Depth and point-map supervision are confidence-aware and include gradient consistency: $\mathbf{e}_c$ 5 and

$\mathbf{e}_c$ 6

Ground-truth depth $\mathbf{e}_c$ 7, point map $\mathbf{e}_c$ 8, and camera translations $\mathbf{e}_c$ 9 are normalized by dividing by the average Euclidean distance of all 3D points in the point map to the origin. The paper explicitly notes that this normalization differs from the normalization used inside GeoAdapter for camera injection.

Optimization details are concrete. OmniVGGT is initialized from pretrained VGGT weights, uses $\mathbf{e}_r$ 0 Alternating-Attention blocks, and is fine-tuned for 10 epochs of 12M iterations each with AdamW. The learning rate is $\mathbf{e}_r$ 1 for prediction heads and $\mathbf{e}_r$ 2 for the backbone, with 5K-step linear warmup and cosine weight decay schedule. Training uses 32 NVIDIA A100 GPUs, gradient checkpointing, and ColorJitter augmentation (Peng et al., 13 Nov 2025).

4. Task coverage and empirical behavior

OmniVGGT is evaluated on monocular depth estimation, multi-view depth estimation or multi-view stereo, camera pose estimation, and sparse-view 3D reconstruction (Peng et al., 13 Nov 2025). The most distinctive empirical pattern is that RGB-only performance is largely preserved, while auxiliary geometric inputs produce substantial gains on harder geometric regimes.

A zero-shot study on Sintel illustrates the model’s operating modes. With RGB only, OmniVGGT slightly improves over VGGT: VGGT reports depth Abs Rel 0.722, $\mathbf{e}_r$ 3 70.81, and AUC@30° 70.55, whereas OmniVGGT reports 0.558, 71.46, and 70.83. With 100% depth injection, Abs Rel improves to 0.106, $\mathbf{e}_r$ 4 to 85.95, and camera AUC@30° to 77.16. With 100% camera injection, depth changes little, but camera pose improves strongly: RRA@5° 99.97, RTA@5° 75.83, and AUC@30° 85.35. With both 100% depth and 100% camera, the model reports Abs Rel 0.106, $\mathbf{e}_r$ 5 85.95, RRA@5° 99.97, RTA@5° 76.33, and AUC@30° 85.99.

On monocular depth estimation, the RGB-only model is competitive with VGGT and improves on some benchmarks. OmniVGGT reports RGB-only results of 0.250 / 68.2 on Sintel, 0.064 / 95.5 on Bonn, and 0.058 / 95.8 on NYU-v2, compared with VGGT’s 0.271 / 67.7, 0.053 / 97.3, and 0.060 / 94.8. With full depth input, OmniVGGT w/ D reaches 0.107 / 90.2 on Sintel, 0.008 / 99.9 on Bonn, and 0.008 / 99.9 on NYU-v2, outperforming Pow3R w/ D across all three reported benchmarks.

On multi-view depth estimation under the RobustMVD protocol, RGB-only OmniVGGT remains essentially on par with VGGT, with average 2.1 / 85.9 versus 2.0 / 85.8. Auxiliary cues change the regime more substantially: OmniVGGT w/ D reports an average 1.0 / 94.8, and OmniVGGT w/ $\mathbf{e}_r$ 6 reports 1.0 / 95.1. Reported benchmark highlights include 0.5 / 98.7 on ETH3D, 0.3 / 99.5 on DTU, and 0.9 / 95.5 on Tanks and Temples.

On camera pose estimation, RGB-only OmniVGGT slightly surpasses VGGT on RealEstate10K unseen and CO3Dv2, reporting 85.9 / 88.4 versus 85.3 / 88.2. With depth, OmniVGGT w/ D reaches 85.5 / 91.3; with camera intrinsics plus relative pose, OmniVGGT w/ $\mathbf{e}_r$ 7 reaches 88.5 / 93.4. Runtime is also emphasized: VGGT and OmniVGGT are both reported at $\mathbf{e}_r$ 8, while Pow3R is reported at $\mathbf{e}_r$ 9, so the paper characterizes OmniVGGT as about 30× faster than Pow3R while remaining flexible in modality count.

Sparse-view reconstruction on 7-Scenes exposes the main caveat in RGB-only mode. VGGT reports Acc mean 0.087, Comp mean 0.091, and NC mean 0.787, while RGB-only OmniVGGT reports 0.104, 0.112, and 0.763. With auxiliary inputs, however, performance changes sharply: OmniVGGT w/ D reports Acc 0.085, Comp 0.085, NC 0.789; OmniVGGT w/ $C=\{K,G\}$ 0 reports Acc 0.037, Comp 0.049, NC 0.778; and OmniVGGT w/ $C=\{K,G\}$ 1 reports Acc 0.036, Comp 0.036, NC 0.810. The paper highlights a 65.4% gain from camera injection on 7-Scenes, with Acc improving from 0.104 to 0.036.

The general empirical pattern is therefore asymmetric across tasks. Depth input most strongly improves dense geometry tasks, camera input most strongly improves pose estimation and sparse-view reconstruction, and the combination is most effective on geometrically ambiguous regimes.

5. Ablation studies and design interpretation

The OmniVGGT ablations are unusually diagnostic because they test alternative adapter designs rather than only reporting end-task metrics (Peng et al., 13 Nov 2025). On Sintel, the paper compares direct token replacement, one-layer camera injection, a depth branch with zero-conv, and the final OmniVGGT design.

The reported conclusion is threefold. First, replacing camera tokens directly with auxiliary tokens is harmful. The “Replace” variant is worst by far, including under full $C=\{K,G\}$ 2, where it reports Abs Rel 0.655 and AUC 77.83. Second, a one-layer adapter is better than replacement but weaker than progressive multi-layer injection, reporting Abs Rel 0.133 and AUC 81.66 under full auxiliary input. Third, zero-conv on the depth branch underperforms the final design, with Abs Rel 0.505 and AUC 84.12, whereas the final OmniVGGT reports Abs Rel 0.106 and AUC 85.99.

These ablations support the paper’s specific architectural claims. Multi-layer progressive camera injection is better than one-shot injection; the camera branch benefits from zero-initialized convolution; the depth branch does not. The resulting interpretation is that OmniVGGT is not a symmetric multimodal transformer. It uses different fusion rules because camera and depth interact with VGGT’s token space in qualitatively different ways.

A second ablation-level conclusion concerns modality value. Depth strongly improves depth estimation and often also improves pose. Camera input strongly improves pose estimation and sparse-view reconstruction. Both together give the best overall results on the hardest settings. At the same time, RGB-only behavior is largely preserved: OmniVGGT usually matches or slightly exceeds VGGT in RGB-only operation, though 7-Scenes reconstruction is an explicit exception.

This leads to a second common misconception. OmniVGGT is not simply a model that improves whenever more modalities are added, independent of task. The paper shows task-dependent asymmetry. Dense depth cues have the clearest effect on dense geometry metrics, while camera cues are particularly consequential when view overlap is limited and geometric ambiguity is dominated by pose uncertainty.

6. Robotics integration, broader significance, and limitations

The OmniVGGT paper extends the model into embodied control by integrating it into a vision-language-action pipeline built on Kosmos-VLA 1.6B with an FCN action head (Peng et al., 13 Nov 2025). OmniVGGT contributes spatial tokens, which are injected and fused with the VLM tokens. On CALVIN, the reported results are modest but consistent. For ABCD $C=\{K,G\}$ 3D, Kosmos-VLA (rgb) reports Avg Len 4.00, the rgb-d point-cloud baseline 4.04, OmniVGGT-based rgb 4.07, and OmniVGGT-based rgb-d 4.08. For zero-shot ABC $C=\{K,G\}$ 4D, Kosmos-VLA (rgb) reports 3.49, Kosmos-VLA (rgb-d) 3.97, OmniVGGT-based rgb 3.92, and OmniVGGT-based rgb-d 3.96. The paper highlights that on zero-shot ABC $C=\{K,G\}$ 5D, OmniVGGT RGB-only surpasses the Kosmos RGB baseline by 0.43 Avg Len.

This VLA result sits within a broader VGGT ecosystem in which geometry tokens are increasingly treated as reusable downstream state. A separate study, “3D-Mix for VLA” (Yu et al., 25 Mar 2026), systematically compared nine VGGT integration schemes and found semantic-conditioned gated fusion to be strongest on SIMPLER and LIBERO. That work does not study OmniVGGT directly, but it reinforces a broader interpretation already suggested by OmniVGGT’s VLA experiment: VGGT-family geometric tokens are useful as an intermediate spatial interface for action models.

Two limitations define the present scope. First, despite the name, the supported auxiliary modalities are limited to depth and camera parameters. The paper does not implement LiDAR, IMU, point clouds, semantics, or language-conditioned geometric inputs. Second, OmniVGGT is explicitly a finetuning extension of pretrained VGGT, not a from-scratch multimodal geometry foundation model. A plausible implication is that system-level work on scaling and deploying VGGT-like models remains relevant to OmniVGGT because OmniVGGT preserves the core VGGT backbone and adds only a lightweight GeoAdapter of 26.8M parameters (Zhang et al., 28 Jan 2026).

A further practical limitation is task-dependent RGB-only tradeoff. OmniVGGT usually matches or slightly exceeds VGGT in RGB-only operation, but 7-Scenes reconstruction remains stronger for VGGT in pure RGB-only evaluation. This should be read as a boundary condition on the multimodal extension rather than a contradiction of its main claim. The architecture is optimized to preserve RGB-only capability while benefitting from auxiliary geometry when available; it is not presented as uniformly dominating VGGT on every RGB-only benchmark.

In the context of VGGT-family research, OmniVGGT represents a shift from unifying output tasks to unifying input geometry. Its main contribution is not a new transformer backbone, but a practical interface by which a pretrained spatial foundation model can absorb optional geometric side information without losing its original feed-forward, multi-view, and RGB-only behavior.