Self-Supervised Surround-View Depth Estimation

Updated 22 November 2025

Self-Supervised Surround-View Depth Estimation is a method that predicts dense 3D depth from synchronized multi-camera images without ground-truth, relying on geometric consistency.
It utilizes shared encoders and cross-view fusion, including attention mechanisms and 3D feature lifting, to integrate overlapping camera information effectively.
Recent approaches achieve performance comparable to LiDAR, using photometric, geometric, and cross-view consistency losses on datasets like DDAD and nuScenes.

Self-supervised surround-view depth estimation comprises a class of techniques that estimate dense 3D structure for an entire 360° field-of-view using only multiple synchronized images from a multi-camera rig, without requiring ground-truth depth for supervision. These methods are critical for autonomous driving and robotics, where expansive depth perception is required in a cost-effective manner. By leveraging the geometric relationships among multiple cameras with overlapping or complementary views, cross-view and temporal constraints, and various forms of self-supervised photometric consistency, recent systems approach the scale and consistency of active sensors such as LiDAR while relying only on vision.

1. Problem Definition and Dataset Structures

Self-supervised surround-view depth estimation methods operate on a set of $N$ synchronized RGB images $\{I^i\}_{i=1}^N$ captured from a rig of calibrated cameras. Each image is associated with known intrinsics ( $K_i$ ) and extrinsics ( $T_i^{\text{rig}}$ ). The core task is to predict dense, metric depth maps $\{D^i\}_{i=1}^N$ corresponding to the per-pixel scene depths in each image's coordinate frame, while enforcing global scene-level consistency.

The most widely used datasets for surround-view depth estimation are DDAD (Drive360 Dense Depth for Automated Driving) and nuScenes. These datasets provide sequences from six-camera rigs capturing the environment around a vehicle, with depth ground truths available for quantitative evaluation (up to 200 m for DDAD, 80 m for nuScenes). Typical data preparation includes resizing images, masking out self-occlusion regions, and aligning camera intrinsics/extrinsics for all surround cameras (Wei et al., 2022, Shi et al., 2023, Ding et al., 2024, Abualhanud et al., 20 Nov 2025).

2. Network Architectures and Cross-View Feature Fusion

Early systems like FSM (Guizilini et al., 2021) treat each camera independently and enforce multi-view consistency primarily at the loss level. Modern self-supervised surround-view depth estimation frameworks integrate cross-view information within the network architecture to improve consistency and accuracy.

Key architectural elements:

Shared Encoder: A single backbone (often ResNet-18/34/50) processes all $N$ camera images in parallel, extracting multi-scale feature maps (Wei et al., 2022, Shi et al., 2023).
Cross-View Attention or Fusion: Joint processing is performed using explicit mechanisms for global or local cross-camera fusion:
- Cross-View Transformer (CVT): Aggregates features from all views with learnable positional encodings, multi-head self-attention, and residual upsampling, capturing long-range dependencies (Wei et al., 2022).
- Guided Local Attention: Restricts attention to a subset of spatially overlapping cameras for efficiency (e.g., EGA-Depth's neighbor-guided attention using linear projections for scalability) (Shi et al., 2023).
- Geometry-Guided Non-Learned Attention: Instead of learned attention, CylinderDepth projects all preliminary depth points from all views onto a common unit cylinder, then aggregates features among pixels based on geodesic neighborhood to enforce cross-view fusion at the feature level, with explicit spatial correspondence (Abualhanud et al., 20 Nov 2025).
- 3D Feature Lifting: Methods like SelfOcc lift 2D features from all cameras to a shared 3D voxel grid or TPV volume, then aggregate via deformable attention across the 3D space (Huang et al., 2023).

Network output: Per-view dense depth maps, optionally fused with occupancy or semantic predictions, ensuring cross-camera consistency and metric scale.

3. Self-Supervised Supervision: Photometric, Geometric, and Cross-View Losses

The self-supervised surround-view paradigm is grounded in leveraging geometric reprojection between views and over time. The principal self-supervision losses include:

3.1. Photometric Reprojection Loss

For each target image $I^i_t$ , pixels are reprojected into temporally adjacent frames (same camera, t±1) or into spatially overlapping neighbor views (camera $j$ at t), using the predicted depth $D^i_t$ and known or predicted poses. The photometric error combines L1 and SSIM distances:

$\ell_{\mathrm{photo}}(p) = \alpha\,\tfrac{1-\mathrm{SSIM}(I^i_t(p),I^j_s(p_s))}{2} + (1-\alpha)\,\lVert I^i_t(p) - I^j_s(p_s) \rVert_1$

These projections rely on analytic camera intrinsics/extrinsics or learned camera models if non-pinhole optics (fisheye, equirectangular) are present (Wei et al., 2022, Ding et al., 2024, Hirose et al., 2021, Hasegawa et al., 2022).

3.2. Smoothness and Edge-Aware Regularization

Predicted depth maps are regularized using edge-aware smoothness on disparities:

$\mathcal L_{\mathrm{smooth}} = \sum_p |\partial_x d(p)| e^{-|\partial_x I(p)|} + |\partial_y d(p)| e^{-|\partial_y I(p)|}$

This penalizes large spatial gradients except at image discontinuities, favoring physically plausible geometry.

3.3. Scale-Aware Supervision

Metric scale is not directly observable in monocular self-supervised learning. Recent approaches solve this problem by:

Leveraging multi-camera rig geometry and joint pose estimation to anchor scale (Wei et al., 2022, Guizilini et al., 2021, Huang et al., 2023).
Using two-view structure-from-motion (SfM) to triangulate sparse correspondences, then pretrain with sparse, metric pseudo-depths (Wei et al., 2022).

3.4. Cross-View Consistency Losses

To ensure that depth maps from overlapping fields of view make consistent predictions for shared 3D points, cross-view consistency penalties are added:

Dense Depth Consistency Loss (DDCL): Penalizes L1 differences between a depth map and its projection from a neighbor's depth prediction using known camera transformations (Ding et al., 2024, Abualhanud et al., 20 Nov 2025).
Multi-View Reconstruction Consistency Loss (MVRCL): Enforces that reconstructions from spatial and spatial-temporal warps are in agreement, adding robustness to dynamic objects and occlusions (Ding et al., 2024).

Geometry-guided explicit attention, as in CylinderDepth, further operationalizes consistency by aggregating features only among corresponding points projected to a shared cylindrical coordinate system (Abualhanud et al., 20 Nov 2025).

4. Pose Estimation and Multi-View Geometry

Accurate ego-motion estimation underpins temporal supervision. Two main strategies exist:

Universal Rig Pose: A single 6-DoF motion for the entire camera rig is regressed and transferred to each camera via extrinsics (Wei et al., 2022, Ding et al., 2024, Abualhanud et al., 20 Nov 2025). This approach is both memory- and compute-efficient while ensuring consistency.
Per-Camera Poses with Consistency Constraints: Each camera's ego-motion is independently predicted, then constrained to agree with others via extrinsics, penalizing deviations in translation and rotation (Guizilini et al., 2021, Wang et al., 2018).

Front-view-only pose estimation further optimizes GPU resource allocation, justified empirically by the higher reliability of front camera depth estimates for ego-motion recovery (Ding et al., 2024).

5. Explicit Geometry Fusion and Non-Pinhole Camera Models

Surround-view systems may include fisheye, equirectangular, or spherical cameras. Specialized geometric models are essential in these contexts:

Learnable Camera Models: Axisymmetric or parameterized projection surfaces are learned end-to-end, generalizing across pinhole and distorted optics (Hirose et al., 2021). Pixel mappings are fully differentiable and support real/simulated supervision.
Cylinder Projections: For perspective rigs, CylinderDepth directly projects all depth maps into a shared cylindrical surface, aligning features across cameras regardless of their initial field of view. Attention is then applied in cylindrical coordinates, exploiting explicit spatial correspondence (Abualhanud et al., 20 Nov 2025).
Cubemap and Spherical Methods: For 360° cameras, cube-padding combined with spherical photometric loss enables seamless feature flow across all faces, mitigating equirectangular distortion (Wang et al., 2018, Hasegawa et al., 2022).

Adaptive camera-geometry-aware convolutions, camera-geometry tensors, and feature concatenation ensure generalization and consistent behavior even under manufacturing variations in lens geometry (Kumar et al., 2021).

6. Quantitative Results and Comparative Performance

Performance is measured by standard depth metrics (Abs Rel, Sq Rel, RMSE, δ<1.25) and increasingly, by explicit cross-view consistency metrics. Key results on DDAD and nuScenes (200 m and 80 m, respectively) include:

Dataset	Method	Abs Rel	Sq Rel	RMSE	δ<1.25	Consistency (overlap)
DDAD	FSM*	0.228	4.409	13.43	68.7%	n/a
DDAD	SurroundDepth	0.208	3.371	12.97	69.3%	7.86 m
DDAD	EGA-Depth (MR)	0.191	3.155	12.55	74.7%	n/a
DDAD	CylinderDepth	0.208	3.480	12.85	70.2%	5.61 m
nuScenes	FSM*	0.319	7.534	7.860	71.6%	n/a
nuScenes	SurroundDepth	0.280	4.401	7.467	66.1%	6.33 m
nuScenes	EGA-Depth (MR)	0.228	1.987	n/a	73.2%	n/a
nuScenes	CylinderDepth	0.238	5.662	6.732	80.5%	2.85 m
nuScenes	SelfOcc (TPV)	0.215	2.743	6.706	75.3%	n/a

State-of-the-art methods such as EGA-Depth (Shi et al., 2023), SelfOcc (Huang et al., 2023), CVCDepth (Ding et al., 2024), and CylinderDepth (Abualhanud et al., 20 Nov 2025) demonstrate consistent improvements both in overall error and in cross-view consistency, particularly in regions where cameras overlap. Geometry-guided or explicit consistency methods (CylinderDepth, CVCDepth) achieve the highest overlap-region alignment.

7. Current Best Practices, Limitations, and Future Directions

Research advances have converged on several effective methodological practices:

Joint feature fusion across views, ideally geometry-guided.
Self-supervised metric scale recovery using rig geometry and SfM pseudo-labels.
Cross-view consistency enforcement, explicit at either the feature or loss level.
Efficient and universal pose estimation, frequently from a front-view only or using rig-level transformations.

Limitations remain in dynamic scenes due to motion or lighting inconsistencies, and in extremely wide-baseline or minimally overlapping camera setups. Geometry-guided (non-learned) attention, as in CylinderDepth, is robust to local intensity outliers but might over-smooth fine details by operating at coarse resolution (Abualhanud et al., 20 Nov 2025). Learning geometry-aware attention at multiple scales or fusing non-learned and learned approaches offers a plausible path forward.

Emerging research also points to explicit 3D fusion in BEV/TPV volumes (Huang et al., 2023), semantic segmentation integration, and further emphasis on cross-domain generalization (e.g., unknown camera models (Hirose et al., 2021), new lens designs). Efficient handling of multi-frame context, real-time performance under memory constraints, and end-to-end multi-task training with auxiliary outputs (occupancy, semantics) continue to drive active development in the field.

Key References:

"SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation" (Wei et al., 2022)
"EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation" (Shi et al., 2023)
"SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction" (Huang et al., 2023)
"Towards Cross-View-Consistent Self-Supervised Surround Depth Estimation" (Ding et al., 2024)
"CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation" (Abualhanud et al., 20 Nov 2025)