Monocular Foundation Models
- Monocular Foundation Models are large-scale, transformer-based vision architectures that extract depth, semantics, and 3D structure from single RGB images.
- They employ efficient cross-attention mechanisms and adapter modules to fuse geometric cues with semantic segmentation for enhanced zero-shot performance.
- These models are applied in practical tasks like monocular depth estimation, SLAM, and digital elevation mapping with parameter-efficient fine-tuning strategies.
Monocular Foundation Models refer to large-scale, pre-trained vision models that provide geometric, semantic, or multi-modal visual understanding directly from a single RGB image, typically in the context of depth estimation, 3D scene understanding, or comprehensive scene representations. These models are trained on vast, heterogeneous datasets using transformer-based architectures, leveraging strong relative, metric, and category-level cues to deliver outstanding zero-shot generalization and efficiency across a wide range of downstream tasks. Monocular foundation models constitute a paradigm shift from specialized network training toward parameter-efficient transfer, explicit fusion of semantic and geometric cues, and robust adaptation to task and domain via structured losses, distillation, or adapter modules.
1. Architectures and Model Families
Central to monocular foundation models are transformer-based encoders such as DINOv2 and ViT, coupled with dense prediction heads (e.g., DPT-style decoders), and sometimes multi-task branches (for segmentation, normal estimation, or surface parsing) (Ma et al., 29 May 2025, Landgraf et al., 14 Jan 2025, Hu et al., 2024). Canonical representatives include:
- DepthAnything: A ViT-based encoder paired with a CNN decoder , trained for affine-invariant (relative) depth using large-scale, mixture datasets (MiDaS-style protocol). Metric3Dv2 extends this concept, integrating a canonical camera-space transformation module to resolve metric ambiguity and enables joint depth-normal prediction (Hu et al., 2024).
- SegmentAnything (SAM): ViT backbone, prompt encoder, and mask decoder for universal semantic segmentation (Ma et al., 29 May 2025). Integration with depth foundation models yields semantically aware geometric features, critical for detailed boundary recovery and scene parsing (e.g., via the Bridging Gate in BriGeS).
- Hybrid/Distilled Models: Fusion or distillation schemes transfer geometric knowledge from large, slow models into lightweight student models (e.g., HRDepth, Monodepth2) for real-time full-surround estimation or low-latency edge inference (Hwang et al., 9 Dec 2025).
Several works extend this concept beyond plain depth estimation:
- Prompt2DEM uses pre-trained DINOv2 and DPT to fuse global low-res elevation prompts for high-resolution digital elevation model (DEM) generation, achieving over upsampling and robust absolute geospatial alignment (Rafaeli et al., 13 Jul 2025).
- FoundationSLAM and related pipelines inject frozen depth or stereo foundation models into the frontend of end-to-end SLAM systems, delivering geometry-aware flow fields and enabling bi-consistent multi-view bundle adjustment at scale (Wu et al., 31 Dec 2025).
2. Cross-Modal and Attention-Based Fusion
An essential advance in recent monocular foundation models is the principled fusion of complementary visual modalities, particularly geometric (depth) and semantic (mask/category) cues. BriGeS (Ma et al., 29 May 2025) introduces a dual-attention “Bridging Gate” module:
- Cross-Attention Block (): Aligns depth features with semantic features , projecting queries from depth and keys/values from semantics into a joint subspace:
- Self-Attention Block (): Refines fused features, enforcing coherence within the joint space.
This architecture admits parameter-efficient training by freezing the underlying foundation models and updating only the gating projections, yielding state-of-the-art fine-structure recovery and resistance to overfitting with minimal additional data.
Additional compositional strategies appear in multi-task or distillation pipelines, where monocular features guide multi-view or cross-modal attention in MVS, 3D occupancy, or SLAM (Lin et al., 10 Mar 2025, Jiang et al., 15 Jul 2025).
3. Training Protocols and Parameter Efficiency
Monocular foundation models are characterized by strict separation of (i) massive pre-training, usually on synthetic, semi-supervised, or pseudo-labeled images (tens of millions), and (ii) highly targeted downstream adaptation.
Parameter-Efficient Fine-Tuning (PEFT):
- Only small adapter modules or attention gates are tuned, keeping the vast majority of foundation weights frozen (Ma et al., 29 May 2025, Zeng et al., 24 Jul 2025, Yao et al., 25 Aug 2025). For example, BriGeS trains just 244M/500M parameters, with 1% data usage and 8–18 h total compute.
- Adapter strategies such as Random Vector LoRA (RVLoRA) are used to facilitate specialization in domain-specific contexts (e.g., endoscopy, low light), increasing adaptation flexibility while maintaining efficient gradient flows (Yao et al., 25 Aug 2025, Zeng et al., 24 Jul 2025).
Synthetic Data Pipelines:
- Realistic simulation modules produce high-variance paired data for specialized regimes—e.g., low-light (DepthDark (Zeng et al., 24 Jul 2025)), adverse weather, or nighttime imagery—enabling robust adaptation with no need for costly ground-truth annotation.
Distillation and Augmentation:
- Structured knowledge distillation schemes align scale-invariant representations via cross-bin interaction losses, or transfer inter-view relations to maintain geometric and temporal consistency across multiple cameras (Hwang et al., 9 Dec 2025, Liang et al., 21 Mar 2025).
- Test-time adaptation exploits external anchoring cues (e.g., sparse LiDAR) for affine rescaling, achieving high-accuracy metric recovery without explicit fine-tuning (Marsal et al., 2024).
4. Applications and Downstream Generalization
The deployment breadth of monocular foundation models spans:
- Generalized Monocular Depth Estimation: Zero-shot and fine-tuned depth predictions across domains—urban, indoor, low-light, and high-resolution satellite data—often achieving state-of-the-art performance and strong cross-dataset generalization (Ma et al., 29 May 2025, Landgraf et al., 14 Jan 2025, Hu et al., 2024).
- 3D Occupancy and Semantics: Weakly supervised occupancy networks "lift" VFM-predicted semantics and metric depth into 3D grids for BEV mapping without any ground-truth 3D supervision (Lin et al., 10 Mar 2025). Performance approaches that of fully supervised detectors.
- SLAM and 3D Fusion: Foundation model-guided flow and geometry allow dense monocular SLAM, integrating global geometric awareness with robust optimization for real-time, accurate mapping (Wu et al., 31 Dec 2025).
- 3D Object Detection: Models such as VFMM3D and MonoDINO-DETR generate pseudo-LiDAR or depth-enhanced features from monocular inputs, providing state-of-the-art results for monocular 3D object detection on datasets such as KITTI (Ding et al., 2024, Kim et al., 1 Feb 2025).
- Digital Elevation Models (DEMs): Prompt2DEM leverages globally aligned DEMs as prompts and fine-tunes on high-resolution input, achieving up to improvement in mean absolute error versus SRTM (Rafaeli et al., 13 Jul 2025).
- Structural and Volumetric Reconstruction: MonoSplat and related architectures inject monocular priors for generalizable, real-time 3D Gaussian splatting, facilitating high-fidelity multi-view rendering across novel domains (Liu et al., 21 May 2025).
- Domain-Specific Adaptation: EndoUFM demonstrates adaptation to surgical scenes, exploiting dual foundation models and efficient adapters for robust depth estimation under severe lighting, texture, and semantic domain gaps (Yao et al., 25 Aug 2025).
5. Uncertainty Quantification and Reliability
A core criterion for foundation model deployment is trustworthy estimation of prediction confidence:
- Aleatoric and Epistemic Uncertainty: Methods such as GNLL, learned confidence heads, Monte Carlo dropout, sub-ensembles, and test-time augmentation inject per-pixel uncertainty estimates into depth maps with no loss of baseline accuracy (Landgraf et al., 14 Jan 2025).
- Operational Value: Uncertainty maps allow for risk-aware planning, safety triggers (e.g., fallback to stereo or slow-down in robotics), and improved interpretability, critical for real-world and safety-critical systems.
- Computational Efficiency: GNLL and confidence-based heads incur negligible additional parameter or runtime costs, making them preferable for deployment.
6. Limitations and Future Directions
While monocular foundation models deliver significant gains, several open issues remain:
- Memory and Compute Footprint: Simultaneous loading of multiple large ViT-based encoders can be prohibitive (Ma et al., 29 May 2025). Distillation and model compression are active research directions.
- Metric and Camera Ambiguity: Foundation models require explicit modules (e.g., camera-space normalization, test-time affine scaling, prompt conditioning) to resolve ambiguity from training on mixed-intrinsic datasets (Hu et al., 2024, Marsal et al., 2024, Rafaeli et al., 13 Jul 2025).
- Modality Alignment: Uniform resizing or naive feature fusion may underutilize fine-grained details from semantic models or be insensitive to target-task requirements.
- Domain Adaptation: Extreme conditions—adverse weather, medical imagery, or low-light—continue to challenge generalist models; simulation, PEFT, and cross-modal adaptation are effective but not universally robust (Zeng et al., 24 Jul 2025, Yao et al., 25 Aug 2025).
- High-Fidelity Geometry: Volumetric and multi-view models based on foundation priors are increasingly competitive but remain constrained by availability of pose information, adaptive depth discretization, and challenge in unposed scenes (Liu et al., 21 May 2025).
- Ablation Findings: Optimally integrating monocular priors is non-trivial—improper fusion or alignment can reduce geometric precision or fail to reduce scale ambiguity meaningfully (Hwang et al., 9 Dec 2025, Liang et al., 21 Mar 2025).
Proposed future developments include knowledge distillation into joint semantic-geometric encoders, adaptive prompt and temperature mechanisms, improved camera metadata extraction, and extension to complex scenes (dynamic, non-Lambertian, large-scale/unstructured environments) (Ma et al., 29 May 2025, Hu et al., 2024). The generalization of structured uncertainty quantification to all downstream monocular tasks—segmentation, pose estimation, dense mapping—remains an active area, with efficient strategies such as GNLL, sub-ensembles, and domain-aware adapters providing compelling research pathways (Landgraf et al., 14 Jan 2025).
7. Summary Table: Representative Monocular Foundation Models
| Model / Paper | Core Architecture | Key Innovations | Typical Applications |
|---|---|---|---|
| DepthAnything (Ma et al., 29 May 2025) | ViT + DPT decoder | Affine-invariant depth modeling | MDE, pseudo-LiDAR, 3D detection |
| Metric3Dv2 (Hu et al., 2024) | ViT or ConvNeXt | Camera-space normalization, joint normals | Metric 3D recovery, normals |
| BriGeS (Ma et al., 29 May 2025) | Dual ViTs + Bridging Gate | Parameter-efficient cross attention | Semantic-aware depth, fine structure |
| Prompt2DEM (Rafaeli et al., 13 Jul 2025) | DINOv2 ViT + DPT | Global elevation prompting, edge loss | DEM super-resolution, geospatial analysis |
| MonoSplat (Liu et al., 21 May 2025) | DAMv2 backbone + adapters | Mono-multi feature fusion, 3D splatting | Real-time 3D rendering, multi-view synthesis |
| FoundationSLAM (Wu et al., 31 Dec 2025) | FlowNet + frozen featurenet | Bi-consistent BA, reliability-aware update | Monocular dense SLAM, real-time mapping |
| EndoUFM (Yao et al., 25 Aug 2025) | DepthAnything + SAM/MedSAM | Dual foundation, PEFT, RVLoRA, Res-DSC | Endoscopic 3D perception, surgical AR |
| DepthDark (Zeng et al., 24 Jul 2025) | DepthAnythingV2 + LLPEFT | Synthetic night data, PEFT, multiscale fuse | Low-light, nighttime MDE |
Monocular foundation models are thus an emergent class of vision architectures, characterized by pre-training at unprecedented scale, principled fusion of geometric and semantic modalities, and broad applicability from classical depth estimation to high-level 3D scene understanding, with increasing attention to robustness, efficiency, and reliable uncertainty quantification.