GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

Published 11 May 2026 in cs.CV | (2605.10525v1)

Abstract: Video depth estimation extends monocular prediction into the temporal domain to ensure coherence. However, existing methods often suffer from spatial blurring in fine-detail regions and temporal inconsistencies. We argue that current approaches, which primarily rely on temporal smoothing via Transformers, struggle to maintain strict 3D geometric consistency-particularly under rotations or drastic view changes. To address this, we propose GemDepth, a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency. Furthermore, GemDepth employs a data-efficient training strategy, effectively bridging the gap between high efficiency and robust geometric consistency. As shown in Fig.2, comprehensive evaluations demonstrate that GemDepth achieves state-of-the-art performance across multiple datasets, particularly in complex dynamic scenarios. The code is publicly available at: https://github.com/Yuecheng919/GemDepth

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces GemDepth, integrating a Geometry-Embedding Module (GEM) with an Alternating Spatio-Temporal Transformer (ASTT) to explicitly model 3D structure and resolve scale ambiguity.
The methodology fuses geometric cues with visual features, achieving up to 46.8% lower AbsRel error and significantly improved temporal consistency on challenging video benchmarks.
The experimental results demonstrate robust, flicker-free depth estimation with minimal computational overhead and resistance to moderate pose noise in real-world scenarios.

Geometry-Embedded Features for Robust 3D-Consistent Video Depth: An Expert Analysis of GemDepth

Motivation and Context

Video depth estimation, the process of recovering dense depth maps from video streams, underpins numerous real-world applications such as autonomous navigation, 3D reconstruction, and AR. While single-image monocular depth estimation (MDE) has seen substantial progress, most solutions fail when extended naively to video sequences, largely due to temporal flickering, geometric inconsistency under camera motion, and scale ambiguity. Existing video depth estimation methods are generally partitioned into generative approaches (e.g., video diffusion models), which preserve high-frequency spatial structure at the expense of extreme computational cost and poor temporal coherence, and discriminative methods, which extend single-frame estimation to videos via frame-wise or lightweight temporal modules but lack true 3D geometric awareness. Both paradigms inadequately address the central requirement: explicit modeling of global 3D structure and camera motion to guarantee geometric consistency, particularly in dynamic or challenging scenes with ego-motion or large viewpoint changes.

Methodology

Architectural Innovations

GemDepth introduces two synergistic modules within a generalized plug-and-play pipeline suitable for diverse discriminative backbones (e.g., DepthAnything V2 [Yang et al., 2024b], VideoDepthAnything [Chen et al., 2025]):

Geometry-Embedding Module (GEM): Built upon a lightweight EfficientPoseNet, GEM predicts per-frame 6-DoF camera poses and a global scale factor. It processes learnable "camera tokens" injected into high-level feature maps and encodes the resulting rotation, translation, and scale estimates into geometric feature embeddings via geometric MLPs. These explicit metric-aware cues are fused with primary visual features, establishing a unified coordinate frame through physically grounded supervision. This approach resolves scale ambiguity and provides intrinsic geometric priors to drive downstream spatio-temporal modeling.
Alternating Spatio-Temporal Transformer (ASTT): ASTT alternates explicitly between two specialized attention mechanisms:
- Temporal Attention: Enforced first, leverages GEM's geometric priors to find precise trajectory-based point correspondences, aligning temporally adjacent features at the point level, which is crucial for mitigating temporal inconsistencies and flicker.
- Spatial Attention: Then aggregates local (intra-frame) and long-range (inter-frame) 3D spatial relations to support high-frequency detail refinement without degrading structural integrity.

Importantly, ASTT is applied to early-stage features immediately post-extraction (contrasting with late-decoder temporal heads as in contemporary baselines), allowing fine-grained geometric detail preservation before excessive abstraction.

Training Paradigm

GemDepth adopts a two-stage training process to circumvent the scarcity of pose-annotated video data:

Geometric Optimization Stage: Jointly supervises pose and depth outputs using a composite, pose-labeled dataset, ensuring strong geometric-alignment capacity.
Depth Refinement Stage: With the trained GEM module frozen, the remaining network components are fine-tuned on large datasets lacking pose annotations. This adaptation enhances robustness to noisy or imperfect pose guidance while enabling efficient supervision.

Losses include: a Huber-based pose loss for the GEM outputs, scale- and shift-invariant depth losses, gradient consistency terms, and a temporal geometric consistency loss.

Experimental Results

Quantitative Performance

GemDepth demonstrates consistently superior performance across four challenging video depth benchmarks (KITTI, Sintel, Scannet, Bonn):

Spatial Precision: Both instantiations (GemDepth-DAv2, GemDepth-VDA) set new state-of-the-art results in AbsRel error and $d_1$ accuracy, e.g., GemDepth-VDA outperforms VideoDepthAnything by a substantial margin (up to 46.8% lower AbsRel on Sintel).
Temporal Consistency: Achieves dramatic reductions in Temporal Alignment Error (TAE) compared with prior methods—surpassing discriminative baselines by 56.1% (DAv2) and 17.5% (VDA) and halving TAE compared to DA3 [Lin et al., 2025].
3D Structural Fidelity: Outperforms parameter-heavy 3D foundation models (VGGT, DA3) in F1-score and ATE, despite only using ~0.58B parameters.
Robustness: Ablation and noise-injection studies confirm that GemDepth’s predictions degrade minimally under moderate pose noise and that the ASTT+GEM combination provides complementary improvements in spatial and temporal metrics.

Computational Efficiency

GemDepth matches or supersedes SOTA discriminative models at a marginal computational overhead (<20%) while operating at an order-of-magnitude lower FLOPs than generative models. The system is practical for real-time or near-real-time deployment.

Qualitative Analysis

Visualizations confirm that GemDepth yields sharper boundary reconstruction, more accurate fine structures, reduced background artifacts, and flicker-free temporal evolution, even under large-scale ego-motion or on dynamic scenes (e.g., reconstructing airborne objects).

Implications and Future Directions

The injection of explicit geometric context (through self-predicted or externally provided camera poses) addresses fundamental weaknesses in prior temporal smoothing or 2D attention-based approaches. By leveraging both geometric and visual cues, the model achieves stronger spatio-temporal coherence and maintains high-frequency detail. The demonstrated robustness to noisy pose predictions signifies that GemDepth can operate reliably in real-world, imperfect conditions—a key step toward deployment in autonomous driving, SLAM, 3D mapping, and AR/VR pipelines where camera extrinsics may be estimated implicitly or subject to error.

Practically, GemDepth's design allows rapid adaptation to novel backbones and data-efficient training in pose-limited regimes. Theoretically, the methodology encourages deeper exploration of holistic geometric-feature fusion—extending beyond pose to potentially full scene graph or motion field integration—for video perception tasks. Future work could integrate dynamic object tracking, end-to-end SLAM feature coupling, multi-camera fusion, or extend to fully unsupervised pose and depth learning in unconstrained natural videos.

Conclusion

GemDepth establishes a new state of the art in video depth estimation by integrating a Geometry-Embedding Module and an Alternating Spatio-Temporal Transformer for explicit 3D-aware feature construction and refinement. The resultant framework resolves major open challenges in temporal and geometric consistency, delivers robust results under real-world noise, and maintains computational tractability. This approach constitutes a generalizable pathway for extending monocular depth estimation solutions into video and streaming 3D perception domains with higher accuracy and reliability than previously possible (2605.10525).

Markdown Report Issue