Panoramic Metric Depth Foundation Model

Updated 20 December 2025

Panoramic Metric Depth Foundation Models are neural architectures that predict accurate metric depth from equirectangular, fisheye, and full-surround imagery using geometry-aware learning.
They utilize transformer-based backbones, panoramic attention mechanisms, and adaptive depth-binning strategies to effectively mitigate distortion and enforce geometric consistency.
Optimization through sharpness-centric and geometry-centric losses, combined with multi-domain and pseudo-label training, enables robust zero-shot performance in varied applications.

A panoramic metric depth foundation model is a large-scale, pre-trained neural architecture designed to infer metric-scale depth for equirectangular, fisheye, or full-surround imagery across indoor and outdoor domains. These models extend the principle of vision foundation modeling to full-sphere camera inputs, leveraging broad, heterogeneous data pipelines and geometry-aware learning frameworks. Recent state-of-the-art panoramic depth foundation models facilitate zero-shot generalization, enforce geometric consistency, and support metric depth output suitable for robotic navigation, scene reconstruction, autonomous driving, and mixed reality applications.

1. Dataset Construction and Domain Bridging

Panoramic metric depth foundation models require panoramic RGB-D datasets that span varied domains and sensor types. The “Depth Any Panoramas (DAP)” framework (Lin et al., 18 Dec 2025) constructs a training corpus of over 2 million images by:

Aggregating public RGB-D panoramas (e.g., Structured3D: 18,298 indoor panoramas with pixel-aligned ground truth).
Rendering synthetic outdoor panoramas from UE5 (AirSim360) with city and park scenes up to 100 m range (90,000 labeled).
Generating synthetic indoor panoramas (DiT-360: 200,000) using text-to-image models.
Scraping 1.7 million real panoramic frames from web videos, classified into indoor/outdoor via scene classifiers.

A progressive three-stage pseudo-label curation pipeline is used to mitigate domain gaps between synthetic/real and indoor/outdoor imagery. The process involves:

Training a scene-invariant model on labeled data.
Using a PatchGAN discriminator to select realism-consistent pseudo-labels for high-confidence samples.
Training a realism-invariant model on both pseudo and real data, then using it to label remaining images.

This supervised and pseudo-supervised data integration enables robust representation learning across diverse panoramic domains, reducing generalization errors due to scene diversity and sensor idiosyncrasies.

2. Model Architectures for Panoramic Depth

Modern panoramic depth foundation models employ transformer-based backbones, panoramic-specific attention, and geometric binning strategies:

DINOv3-Large is used in DAP (Lin et al., 18 Dec 2025) for robust feature extraction under heavy equirectangular distortion, outperforming Swin, ResNet, and ViT in benchmark cross-domain tests.
PanoFormer (Shen et al., 2022) introduces Panoramic Structure-guided Transformer (PST) blocks: self-attention is redefined over spherical tangent patches, and token flows are learned to mitigate ERP distortion.
The FSMDE paradigm (Hwang et al., 9 Dec 2025) augments lightweight backbones (e.g., Monodepth2, MonoViT) with MetricBins depth-binning heads that predict adaptive per-pixel bin centers and probabilities.

All models process ERP-format images, supporting Spherical Token Locating Models (STLM) that map ERP pixel coordinates to tangent-plane neighborhoods or icosahedrally sampled perspective patches. Feature extraction and multi-head attention are adapted to account for panorama-specific distortions and spatial continuity requirements.

3. Optimization Strategies and Training Objectives

Panoramic metric depth models optimize for geometric consistency, sharpness, and metric-scale accuracy:

Sharpness-centric losses: Densely fused Gram loss (DF) ensures consistency across icosahedron-decomposed perspective patches; Sobel-based gradient losses focus learning on object boundaries (Lin et al., 18 Dec 2025).
Geometry-centric losses: Scale-invariant log (SILog) penalizes multiplicative errors in metric depth (Lin et al., 18 Dec 2025, Guo et al., 5 Jan 2025); surface normal and point cloud losses enforce 3D geometric consistency and accurate scene shape.
Range Mask Heads: Plug-and-play binary heads for distance thresholds (e.g., 10/20/50/100 m) mask depth outputs, adapting network predictions to applicable spatial ranges; loss combines $\ell_2$ distance and Dice similarity.
View-relational Knowledge Distillation: FSMDE (Hwang et al., 9 Dec 2025) applies cross-interaction distillation of scale-invariant bin probabilities and adds view-relational Huber potentials to maintain cross-camera consistency.

The final loss function in DAP is a weighted sum over these components, modulated by a precomputed ERP distortion map. Curriculum training incorporates both synthetic and pseudo-labeled real samples to stabilize learning and attenuate domain-specific artifacts.

4. Evaluation Procedures and Benchmark Comparisons

Evaluation of panoramic metric depth foundation models is conducted on indoor and outdoor panoramic benchmarks, using both standard and panorama-specific metrics:

Benchmark	AbsRel (↓)	RMSE (↓)	δ₁ (↑)
Stanford2D3D	0.0921	0.3820	0.9135 (DAP)
Matterport3D	0.0921	0.3820	0.9135 (DAP)
Deep360	0.0659	--	-- (DAP)
DAP-Test (Outdoor)	0.0781	6.804	0.9370 (DAP)

On indoor sets, DAP demonstrates robust zero-shot metric accuracy: AbsRel = 0.0921, δ₁ = 0.9135, outperforming DAC (AbsRel = 0.1366) and Unik3D (Lin et al., 18 Dec 2025). Ablation studies indicate that sharpness losses and the distortion map markedly improve accuracy. On outdoor sets, empirical findings confirm that range mask heads with T=100 m optimize near/far range error trade-offs. FSMDE student networks, using scale-invariant and view-relational distillation, achieve real-time inference rates (>80 FPS) and outperform basic supervised baselines (Hwang et al., 9 Dec 2025).

Panorama-specific metrics introduced by PanoFormer (Shen et al., 2022) diagnose errors particular to equirectangular formats: Pole RMSE quantifies errors in polar regions, and Left-Right Consistency Error (LRCE) assesses seam continuity.

5. Key Techniques: ERP Mapping, Attention, Distillation

A unified geometric representation is central to panoramic metric depth modeling. DAC (Guo et al., 5 Jan 2025) and DAP (Lin et al., 18 Dec 2025) convert input images (pinhole, fisheye, 360°) to ERP using closed-form geometry—pitch-aware ERP conversion draws spherical latitude and longitude per pixel, applies gnomonic projection, camera distortion, and field-of-view alignment.

Panoramic attention mechanisms (e.g., PST blocks in PanoFormer (Shen et al., 2022)) use tangent-patch extraction and learnable token flows for local context, while cross-interaction distillation (FSMDE) and view-relational losses enforce scale consistency and cross-view geometric structure.

Multi-resolution augmentation and scale-equivariant feature learning further enhance zero-shot generalization, particularly on test-time images with varied resolution or field of view (Guo et al., 5 Jan 2025).

6. Limitations and Future Directions

Existing panoramic metric depth foundation models face persistent challenges:

Real-world outdoor data scarcity: Limited outdoor panoramic RGB-D datasets restrict validation in extreme conditions (e.g., weather, lighting) (Lin et al., 18 Dec 2025).
High computational costs: Large transformer backbones (e.g., DINOv3-Large) are resource-intensive—efficiency-oriented distillation frameworks (FSMDE) are an active area of research (Hwang et al., 9 Dec 2025).
ERP pole distortion: Severe distortion at polar regions is only partially mitigated by tangent-patch and cubemap hybrid approaches; further work will focus on equivariant convolutions or mixed representations (Guo et al., 5 Jan 2025, Shen et al., 2022).
Temporal and multi-modal consistency: Extension to panoramic video, semantic fusion (surface reflectance, segmentation), and self-supervised learning across vast 360° video data are proposed future directions (Lin et al., 18 Dec 2025).

This suggests that model scalability, domain adaptation, and representation robustness will remain central research questions. A plausible implication is that hybrid supervision (combining synthetic, real, and pseudo-labeled data) and efficient backbone architectures will be crucial for widespread deployment in real-time autonomous platforms.

7. Applications and Practical Impact

Panoramic metric depth foundation models are enabling technologies for:

360° robotic navigation (obstacle avoidance, SLAM)
Virtual/augmented reality scene reconstruction
Real estate and urban planning workflows
Fully autonomous driving scenarios leveraging Full Surround Monocular Depth Estimation (FSMDE) at real-time speeds (>80 FPS)

Zero-shot generalization across camera types and scene domains facilitates broad adoption, while plug-and-play architecture components (e.g., range mask heads) offer operational flexibility. Continuous improvements in geometric consistency and domain adaptation are expanding application scope in large-scale visual environments.