Papers
Topics
Authors
Recent
2000 character limit reached

Self-Supervised Surround-View Depth Estimation

Updated 22 November 2025
  • Self-Supervised Surround-View Depth Estimation is a method that predicts dense 3D depth from synchronized multi-camera images without ground-truth, relying on geometric consistency.
  • It utilizes shared encoders and cross-view fusion, including attention mechanisms and 3D feature lifting, to integrate overlapping camera information effectively.
  • Recent approaches achieve performance comparable to LiDAR, using photometric, geometric, and cross-view consistency losses on datasets like DDAD and nuScenes.

Self-supervised surround-view depth estimation comprises a class of techniques that estimate dense 3D structure for an entire 360° field-of-view using only multiple synchronized images from a multi-camera rig, without requiring ground-truth depth for supervision. These methods are critical for autonomous driving and robotics, where expansive depth perception is required in a cost-effective manner. By leveraging the geometric relationships among multiple cameras with overlapping or complementary views, cross-view and temporal constraints, and various forms of self-supervised photometric consistency, recent systems approach the scale and consistency of active sensors such as LiDAR while relying only on vision.

1. Problem Definition and Dataset Structures

Self-supervised surround-view depth estimation methods operate on a set of NN synchronized RGB images {Ii}i=1N\{I^i\}_{i=1}^N captured from a rig of calibrated cameras. Each image is associated with known intrinsics (KiK_i) and extrinsics (TirigT_i^{\text{rig}}). The core task is to predict dense, metric depth maps {Di}i=1N\{D^i\}_{i=1}^N corresponding to the per-pixel scene depths in each image's coordinate frame, while enforcing global scene-level consistency.

The most widely used datasets for surround-view depth estimation are DDAD (Drive360 Dense Depth for Automated Driving) and nuScenes. These datasets provide sequences from six-camera rigs capturing the environment around a vehicle, with depth ground truths available for quantitative evaluation (up to 200 m for DDAD, 80 m for nuScenes). Typical data preparation includes resizing images, masking out self-occlusion regions, and aligning camera intrinsics/extrinsics for all surround cameras (Wei et al., 2022, Shi et al., 2023, Ding et al., 4 Jul 2024, Abualhanud et al., 20 Nov 2025).

2. Network Architectures and Cross-View Feature Fusion

Early systems like FSM (Guizilini et al., 2021) treat each camera independently and enforce multi-view consistency primarily at the loss level. Modern self-supervised surround-view depth estimation frameworks integrate cross-view information within the network architecture to improve consistency and accuracy.

Key architectural elements:

  • Shared Encoder: A single backbone (often ResNet-18/34/50) processes all NN camera images in parallel, extracting multi-scale feature maps (Wei et al., 2022, Shi et al., 2023).
  • Cross-View Attention or Fusion: Joint processing is performed using explicit mechanisms for global or local cross-camera fusion:
    • Cross-View Transformer (CVT): Aggregates features from all views with learnable positional encodings, multi-head self-attention, and residual upsampling, capturing long-range dependencies (Wei et al., 2022).
    • Guided Local Attention: Restricts attention to a subset of spatially overlapping cameras for efficiency (e.g., EGA-Depth's neighbor-guided attention using linear projections for scalability) (Shi et al., 2023).
    • Geometry-Guided Non-Learned Attention: Instead of learned attention, CylinderDepth projects all preliminary depth points from all views onto a common unit cylinder, then aggregates features among pixels based on geodesic neighborhood to enforce cross-view fusion at the feature level, with explicit spatial correspondence (Abualhanud et al., 20 Nov 2025).
    • 3D Feature Lifting: Methods like SelfOcc lift 2D features from all cameras to a shared 3D voxel grid or TPV volume, then aggregate via deformable attention across the 3D space (Huang et al., 2023).

Network output: Per-view dense depth maps, optionally fused with occupancy or semantic predictions, ensuring cross-camera consistency and metric scale.

3. Self-Supervised Supervision: Photometric, Geometric, and Cross-View Losses

The self-supervised surround-view paradigm is grounded in leveraging geometric reprojection between views and over time. The principal self-supervision losses include:

3.1. Photometric Reprojection Loss

For each target image ItiI^i_t, pixels are reprojected into temporally adjacent frames (same camera, t±1) or into spatially overlapping neighbor views (camera jj at t), using the predicted depth DtiD^i_t and known or predicted poses. The photometric error combines L1 and SSIM distances:

photo(p)=α1SSIM(Iti(p),Isj(ps))2+(1α)Iti(p)Isj(ps)1\ell_{\mathrm{photo}}(p) = \alpha\,\tfrac{1-\mathrm{SSIM}(I^i_t(p),I^j_s(p_s))}{2} + (1-\alpha)\,\lVert I^i_t(p) - I^j_s(p_s) \rVert_1

These projections rely on analytic camera intrinsics/extrinsics or learned camera models if non-pinhole optics (fisheye, equirectangular) are present (Wei et al., 2022, Ding et al., 4 Jul 2024, Hirose et al., 2021, Hasegawa et al., 2022).

3.2. Smoothness and Edge-Aware Regularization

Predicted depth maps are regularized using edge-aware smoothness on disparities:

Lsmooth=pxd(p)exI(p)+yd(p)eyI(p)\mathcal L_{\mathrm{smooth}} = \sum_p |\partial_x d(p)| e^{-|\partial_x I(p)|} + |\partial_y d(p)| e^{-|\partial_y I(p)|}

This penalizes large spatial gradients except at image discontinuities, favoring physically plausible geometry.

3.3. Scale-Aware Supervision

Metric scale is not directly observable in monocular self-supervised learning. Recent approaches solve this problem by:

3.4. Cross-View Consistency Losses

To ensure that depth maps from overlapping fields of view make consistent predictions for shared 3D points, cross-view consistency penalties are added:

  • Dense Depth Consistency Loss (DDCL): Penalizes L1 differences between a depth map and its projection from a neighbor's depth prediction using known camera transformations (Ding et al., 4 Jul 2024, Abualhanud et al., 20 Nov 2025).
  • Multi-View Reconstruction Consistency Loss (MVRCL): Enforces that reconstructions from spatial and spatial-temporal warps are in agreement, adding robustness to dynamic objects and occlusions (Ding et al., 4 Jul 2024).

Geometry-guided explicit attention, as in CylinderDepth, further operationalizes consistency by aggregating features only among corresponding points projected to a shared cylindrical coordinate system (Abualhanud et al., 20 Nov 2025).

4. Pose Estimation and Multi-View Geometry

Accurate ego-motion estimation underpins temporal supervision. Two main strategies exist:

Front-view-only pose estimation further optimizes GPU resource allocation, justified empirically by the higher reliability of front camera depth estimates for ego-motion recovery (Ding et al., 4 Jul 2024).

5. Explicit Geometry Fusion and Non-Pinhole Camera Models

Surround-view systems may include fisheye, equirectangular, or spherical cameras. Specialized geometric models are essential in these contexts:

  • Learnable Camera Models: Axisymmetric or parameterized projection surfaces are learned end-to-end, generalizing across pinhole and distorted optics (Hirose et al., 2021). Pixel mappings are fully differentiable and support real/simulated supervision.
  • Cylinder Projections: For perspective rigs, CylinderDepth directly projects all depth maps into a shared cylindrical surface, aligning features across cameras regardless of their initial field of view. Attention is then applied in cylindrical coordinates, exploiting explicit spatial correspondence (Abualhanud et al., 20 Nov 2025).
  • Cubemap and Spherical Methods: For 360° cameras, cube-padding combined with spherical photometric loss enables seamless feature flow across all faces, mitigating equirectangular distortion (Wang et al., 2018, Hasegawa et al., 2022).

Adaptive camera-geometry-aware convolutions, camera-geometry tensors, and feature concatenation ensure generalization and consistent behavior even under manufacturing variations in lens geometry (Kumar et al., 2021).

6. Quantitative Results and Comparative Performance

Performance is measured by standard depth metrics (Abs Rel, Sq Rel, RMSE, δ<1.25) and increasingly, by explicit cross-view consistency metrics. Key results on DDAD and nuScenes (200 m and 80 m, respectively) include:

Dataset Method Abs Rel Sq Rel RMSE δ<1.25 Consistency (overlap)
DDAD FSM* 0.228 4.409 13.43 68.7% n/a
DDAD SurroundDepth 0.208 3.371 12.97 69.3% 7.86 m
DDAD EGA-Depth (MR) 0.191 3.155 12.55 74.7% n/a
DDAD CylinderDepth 0.208 3.480 12.85 70.2% 5.61 m
nuScenes FSM* 0.319 7.534 7.860 71.6% n/a
nuScenes SurroundDepth 0.280 4.401 7.467 66.1% 6.33 m
nuScenes EGA-Depth (MR) 0.228 1.987 n/a 73.2% n/a
nuScenes CylinderDepth 0.238 5.662 6.732 80.5% 2.85 m
nuScenes SelfOcc (TPV) 0.215 2.743 6.706 75.3% n/a

State-of-the-art methods such as EGA-Depth (Shi et al., 2023), SelfOcc (Huang et al., 2023), CVCDepth (Ding et al., 4 Jul 2024), and CylinderDepth (Abualhanud et al., 20 Nov 2025) demonstrate consistent improvements both in overall error and in cross-view consistency, particularly in regions where cameras overlap. Geometry-guided or explicit consistency methods (CylinderDepth, CVCDepth) achieve the highest overlap-region alignment.

7. Current Best Practices, Limitations, and Future Directions

Research advances have converged on several effective methodological practices:

  • Joint feature fusion across views, ideally geometry-guided.
  • Self-supervised metric scale recovery using rig geometry and SfM pseudo-labels.
  • Cross-view consistency enforcement, explicit at either the feature or loss level.
  • Efficient and universal pose estimation, frequently from a front-view only or using rig-level transformations.

Limitations remain in dynamic scenes due to motion or lighting inconsistencies, and in extremely wide-baseline or minimally overlapping camera setups. Geometry-guided (non-learned) attention, as in CylinderDepth, is robust to local intensity outliers but might over-smooth fine details by operating at coarse resolution (Abualhanud et al., 20 Nov 2025). Learning geometry-aware attention at multiple scales or fusing non-learned and learned approaches offers a plausible path forward.

Emerging research also points to explicit 3D fusion in BEV/TPV volumes (Huang et al., 2023), semantic segmentation integration, and further emphasis on cross-domain generalization (e.g., unknown camera models (Hirose et al., 2021), new lens designs). Efficient handling of multi-frame context, real-time performance under memory constraints, and end-to-end multi-task training with auxiliary outputs (occupancy, semantics) continue to drive active development in the field.


Key References:

  • "SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation" (Wei et al., 2022)
  • "EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation" (Shi et al., 2023)
  • "SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction" (Huang et al., 2023)
  • "Towards Cross-View-Consistent Self-Supervised Surround Depth Estimation" (Ding et al., 4 Jul 2024)
  • "CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation" (Abualhanud et al., 20 Nov 2025)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Self-Supervised Surround-View Depth Estimation.