Monocular Foundation Models

Updated 8 February 2026

Monocular Foundation Models are large-scale, transformer-based vision architectures that extract depth, semantics, and 3D structure from single RGB images.
They employ efficient cross-attention mechanisms and adapter modules to fuse geometric cues with semantic segmentation for enhanced zero-shot performance.
These models are applied in practical tasks like monocular depth estimation, SLAM, and digital elevation mapping with parameter-efficient fine-tuning strategies.

Monocular Foundation Models refer to large-scale, pre-trained vision models that provide geometric, semantic, or multi-modal visual understanding directly from a single RGB image, typically in the context of depth estimation, 3D scene understanding, or comprehensive scene representations. These models are trained on vast, heterogeneous datasets using transformer-based architectures, leveraging strong relative, metric, and category-level cues to deliver outstanding zero-shot generalization and efficiency across a wide range of downstream tasks. Monocular foundation models constitute a paradigm shift from specialized network training toward parameter-efficient transfer, explicit fusion of semantic and geometric cues, and robust adaptation to task and domain via structured losses, distillation, or adapter modules.

1. Architectures and Model Families

Central to monocular foundation models are transformer-based encoders such as DINOv2 and ViT, coupled with dense prediction heads (e.g., DPT-style decoders), and sometimes multi-task branches (for segmentation, normal estimation, or surface parsing) (Ma et al., 29 May 2025, Landgraf et al., 14 Jan 2025, Hu et al., 2024). Canonical representatives include:

DepthAnything: A ViT-based encoder $E_d$ paired with a CNN decoder $D_d$ , trained for affine-invariant (relative) depth using large-scale, mixture datasets (MiDaS-style protocol). Metric3Dv2 extends this concept, integrating a canonical camera-space transformation module to resolve metric ambiguity and enables joint depth-normal prediction (Hu et al., 2024).
SegmentAnything (SAM): ViT backbone, prompt encoder, and mask decoder for universal semantic segmentation (Ma et al., 29 May 2025). Integration with depth foundation models yields semantically aware geometric features, critical for detailed boundary recovery and scene parsing (e.g., via the Bridging Gate in BriGeS).
Hybrid/Distilled Models: Fusion or distillation schemes transfer geometric knowledge from large, slow models into lightweight student models (e.g., HRDepth, Monodepth2) for real-time full-surround estimation or low-latency edge inference (Hwang et al., 9 Dec 2025).

Several works extend this concept beyond plain depth estimation:

Prompt2DEM uses pre-trained DINOv2 and DPT to fuse global low-res elevation prompts for high-resolution digital elevation model (DEM) generation, achieving over $100\times$ upsampling and robust absolute geospatial alignment (Rafaeli et al., 13 Jul 2025).
FoundationSLAM and related pipelines inject frozen depth or stereo foundation models into the frontend of end-to-end SLAM systems, delivering geometry-aware flow fields and enabling bi-consistent multi-view bundle adjustment at scale (Wu et al., 31 Dec 2025).

An essential advance in recent monocular foundation models is the principled fusion of complementary visual modalities, particularly geometric (depth) and semantic (mask/category) cues. BriGeS (Ma et al., 29 May 2025) introduces a dual-attention “Bridging Gate” module:

Cross-Attention Block ( $\mathrm{Block}_c$ ): Aligns depth features $f_d$ with semantic features $\tilde f_s$ , projecting queries from depth and keys/values from semantics into a joint subspace:

$Q_d = W_q^c(f_d), \quad K_s = W_k^c(\tilde f_s), \quad V_s = W_v^c(\tilde f_s)$

$F_c = \mathrm{MLP}\Bigl(\mathrm{Softmax}\left(\frac{Q_d K_s^\top}{\sqrt{d}}\right) V_s\Bigr)$

Self-Attention Block ( $\mathrm{Block}_s$ ): Refines fused features, enforcing coherence within the joint space.

This architecture admits parameter-efficient training by freezing the underlying foundation models and updating only the gating projections, yielding state-of-the-art fine-structure recovery and resistance to overfitting with minimal additional data.

Additional compositional strategies appear in multi-task or distillation pipelines, where monocular features guide multi-view or cross-modal attention in MVS, 3D occupancy, or SLAM (Lin et al., 10 Mar 2025, Jiang et al., 15 Jul 2025).

3. Training Protocols and Parameter Efficiency

Monocular foundation models are characterized by strict separation of (i) massive pre-training, usually on synthetic, semi-supervised, or pseudo-labeled images (tens of millions), and (ii) highly targeted downstream adaptation.

Parameter-Efficient Fine-Tuning (PEFT):

Only small adapter modules or attention gates are tuned, keeping the vast majority of foundation weights frozen (Ma et al., 29 May 2025, Zeng et al., 24 Jul 2025, Yao et al., 25 Aug 2025). For example, BriGeS trains just 244M/500M parameters, with $\sim$ 1% data usage and 8–18 h total compute.
Adapter strategies such as Random Vector LoRA (RVLoRA) are used to facilitate specialization in domain-specific contexts (e.g., endoscopy, low light), increasing adaptation flexibility while maintaining efficient gradient flows (Yao et al., 25 Aug 2025, Zeng et al., 24 Jul 2025).

Synthetic Data Pipelines:

Realistic simulation modules produce high-variance paired data for specialized regimes—e.g., low-light (DepthDark (Zeng et al., 24 Jul 2025)), adverse weather, or nighttime imagery—enabling robust adaptation with no need for costly ground-truth annotation.

Distillation and Augmentation:

Structured knowledge distillation schemes align scale-invariant representations via cross-bin interaction losses, or transfer inter-view relations to maintain geometric and temporal consistency across multiple cameras (Hwang et al., 9 Dec 2025, Liang et al., 21 Mar 2025).
Test-time adaptation exploits external anchoring cues (e.g., sparse LiDAR) for affine rescaling, achieving high-accuracy metric recovery without explicit fine-tuning (Marsal et al., 2024).

4. Applications and Downstream Generalization

The deployment breadth of monocular foundation models spans:

Generalized Monocular Depth Estimation: Zero-shot and fine-tuned depth predictions across domains—urban, indoor, low-light, and high-resolution satellite data—often achieving state-of-the-art performance and strong cross-dataset generalization (Ma et al., 29 May 2025, Landgraf et al., 14 Jan 2025, Hu et al., 2024).
3D Occupancy and Semantics: Weakly supervised occupancy networks "lift" VFM-predicted semantics and metric depth into 3D grids for BEV mapping without any ground-truth 3D supervision (Lin et al., 10 Mar 2025). Performance approaches that of fully supervised detectors.
SLAM and 3D Fusion: Foundation model-guided flow and geometry allow dense monocular SLAM, integrating global geometric awareness with robust optimization for real-time, accurate mapping (Wu et al., 31 Dec 2025).
3D Object Detection: Models such as VFMM3D and MonoDINO-DETR generate pseudo-LiDAR or depth-enhanced features from monocular inputs, providing state-of-the-art results for monocular 3D object detection on datasets such as KITTI (Ding et al., 2024, Kim et al., 1 Feb 2025).
Digital Elevation Models (DEMs): Prompt2DEM leverages globally aligned DEMs as prompts and fine-tunes on high-resolution input, achieving up to $18\%$ improvement in mean absolute error versus SRTM (Rafaeli et al., 13 Jul 2025).
Structural and Volumetric Reconstruction: MonoSplat and related architectures inject monocular priors for generalizable, real-time 3D Gaussian splatting, facilitating high-fidelity multi-view rendering across novel domains (Liu et al., 21 May 2025).
Domain-Specific Adaptation: EndoUFM demonstrates adaptation to surgical scenes, exploiting dual foundation models and efficient adapters for robust depth estimation under severe lighting, texture, and semantic domain gaps (Yao et al., 25 Aug 2025).

5. Uncertainty Quantification and Reliability

A core criterion for foundation model deployment is trustworthy estimation of prediction confidence:

Aleatoric and Epistemic Uncertainty: Methods such as GNLL, learned confidence heads, Monte Carlo dropout, sub-ensembles, and test-time augmentation inject per-pixel uncertainty estimates into depth maps with no loss of baseline accuracy (Landgraf et al., 14 Jan 2025).
Operational Value: Uncertainty maps allow for risk-aware planning, safety triggers (e.g., fallback to stereo or slow-down in robotics), and improved interpretability, critical for real-world and safety-critical systems.
Computational Efficiency: GNLL and confidence-based heads incur negligible additional parameter or runtime costs, making them preferable for deployment.

6. Limitations and Future Directions

While monocular foundation models deliver significant gains, several open issues remain:

Memory and Compute Footprint: Simultaneous loading of multiple large ViT-based encoders can be prohibitive (Ma et al., 29 May 2025). Distillation and model compression are active research directions.
Metric and Camera Ambiguity: Foundation models require explicit modules (e.g., camera-space normalization, test-time affine scaling, prompt conditioning) to resolve ambiguity from training on mixed-intrinsic datasets (Hu et al., 2024, Marsal et al., 2024, Rafaeli et al., 13 Jul 2025).
Modality Alignment: Uniform resizing or naive feature fusion may underutilize fine-grained details from semantic models or be insensitive to target-task requirements.
Domain Adaptation: Extreme conditions—adverse weather, medical imagery, or low-light—continue to challenge generalist models; simulation, PEFT, and cross-modal adaptation are effective but not universally robust (Zeng et al., 24 Jul 2025, Yao et al., 25 Aug 2025).
High-Fidelity Geometry: Volumetric and multi-view models based on foundation priors are increasingly competitive but remain constrained by availability of pose information, adaptive depth discretization, and challenge in unposed scenes (Liu et al., 21 May 2025).
Ablation Findings: Optimally integrating monocular priors is non-trivial—improper fusion or alignment can reduce geometric precision or fail to reduce scale ambiguity meaningfully (Hwang et al., 9 Dec 2025, Liang et al., 21 Mar 2025).

Proposed future developments include knowledge distillation into joint semantic-geometric encoders, adaptive prompt and temperature mechanisms, improved camera metadata extraction, and extension to complex scenes (dynamic, non-Lambertian, large-scale/unstructured environments) (Ma et al., 29 May 2025, Hu et al., 2024). The generalization of structured uncertainty quantification to all downstream monocular tasks—segmentation, pose estimation, dense mapping—remains an active area, with efficient strategies such as GNLL, sub-ensembles, and domain-aware adapters providing compelling research pathways (Landgraf et al., 14 Jan 2025).

7. Summary Table: Representative Monocular Foundation Models

Model / Paper	Core Architecture	Key Innovations	Typical Applications
DepthAnything (Ma et al., 29 May 2025)	ViT + DPT decoder	Affine-invariant depth modeling	MDE, pseudo-LiDAR, 3D detection
Metric3Dv2 (Hu et al., 2024)	ViT or ConvNeXt	Camera-space normalization, joint normals	Metric 3D recovery, normals
BriGeS (Ma et al., 29 May 2025)	Dual ViTs + Bridging Gate	Parameter-efficient cross attention	Semantic-aware depth, fine structure
Prompt2DEM (Rafaeli et al., 13 Jul 2025)	DINOv2 ViT + DPT	Global elevation prompting, edge loss	DEM super-resolution, geospatial analysis
MonoSplat (Liu et al., 21 May 2025)	DAMv2 backbone + adapters	Mono-multi feature fusion, 3D splatting	Real-time 3D rendering, multi-view synthesis
FoundationSLAM (Wu et al., 31 Dec 2025)	FlowNet + frozen featurenet	Bi-consistent BA, reliability-aware update	Monocular dense SLAM, real-time mapping
EndoUFM (Yao et al., 25 Aug 2025)	DepthAnything + SAM/MedSAM	Dual foundation, PEFT, RVLoRA, Res-DSC	Endoscopic 3D perception, surgical AR
DepthDark (Zeng et al., 24 Jul 2025)	DepthAnythingV2 + LLPEFT	Synthetic night data, PEFT, multiscale fuse	Low-light, nighttime MDE

Monocular foundation models are thus an emergent class of vision architectures, characterized by pre-training at unprecedented scale, principled fusion of geometric and semantic modalities, and broad applicability from classical depth estimation to high-level 3D scene understanding, with increasing attention to robustness, efficiency, and reliable uncertainty quantification.

Markdown Upgrade to Chat

References (15)

Bridging Geometric and Semantic Foundation Models for Generalized Monocular Depth Estimation (2025)

A Critical Synthesis of Uncertainty Quantification and Foundation Models in Monocular Depth Estimation (2025)

Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation (2024)

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth (2025)

Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model (2025)

FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM (2025)

Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation (2025)

MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network (2025)

DepthDark: Robust Monocular Depth Estimation for Low-Light Environments (2025)

10.

EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images (2025)

11.

Distilling Monocular Foundation Model for Fine-grained Depth Completion (2025)

12.

A Simple yet Effective Test-Time Adaptation for Zero-Shot Monocular Metric Depth Estimation (2024)

13.

VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection (2024)

14.

MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model (2025)

15.

MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Monocular Foundation Models.

Monocular Foundation Models

1. Architectures and Model Families

3. Training Protocols and Parameter Efficiency

4. Applications and Downstream Generalization

5. Uncertainty Quantification and Reliability

6. Limitations and Future Directions

7. Summary Table: Representative Monocular Foundation Models

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Monocular Foundation Models

1. Architectures and Model Families

2. Cross-Modal and Attention-Based Fusion

3. Training Protocols and Parameter Efficiency

4. Applications and Downstream Generalization

5. Uncertainty Quantification and Reliability

6. Limitations and Future Directions

7. Summary Table: Representative Monocular Foundation Models

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research