Self-Supervised Depth & Egomotion

Updated 25 February 2026

Self-supervised learning of depth and egomotion is a method that infers 3D scene geometry and camera motion from videos without explicit labels by leveraging geometric view synthesis.
Key training objectives include photometric reconstruction, edge-aware depth smoothness, and adversarial losses to enforce high-quality depth and pose estimates.
Modern architectures integrate multi-modal sensor data and dynamic scene disentanglement to enhance robustness and scalability in robotics, AR/VR, and autonomous navigation.

Self-supervised learning of depth and egomotion refers to a family of methods that jointly estimate the 3D scene geometry (dense depth) and 6-DoF camera motion from video without requiring explicit ground-truth annotations. These methodologies exploit geometric and photometric consistencies in video sequences and have led to rapid advances in visual odometry (VO), visual-inertial odometry (VIO), and scene understanding with data from monocular RGB, RGB-D, thermal, and other sensor modalities. The field is motivated by the need to reduce annotation requirements, improve adaptability to new domains, and enable scalable, in situ deployment for robotics, AR/VR, and autonomous navigation.

1. Core Principles and Training Objectives

Self-supervised depth and egomotion pipelines are built upon the principle of geometric view synthesis. Given consecutive frames (and, in some approaches, inertial measurements), depth and pose networks are trained to reconstruct a reference frame by warping context frames into the reference view using predicted depths and camera poses. This process yields a photometric reconstruction loss, which—along with additional geometric and regularization penalties—acts as a supervisory signal. Typically, the core loss can be summarized as: $L_\mathrm{photo} = \sum_{s\in\{t-1, t+1\}} \sum_{p} \left|I_t(p) - I_s(\pi(K\,T_{t\to s} D_t(p) K^{-1} p))\right|$ where $I_t$ , $I_s$ are image frames, $D_t$ is predicted depth, $T_{t\to s}$ the SE(3) camera motion, $K$ the camera intrinsics, and $\pi$ the projection function (Almalioglu et al., 2019, Gao et al., 2020).

Additional loss terms include:

Edge-aware depth smoothness: Encourages depth gradients to align with image edges.
Geometric consistency: Enforces agreement between depths estimated from different viewpoints.
Adversarial loss: Used to improve depth realism via a discriminator (e.g., PatchGAN).

Self-supervised methods extend naturally to new sensor modalities. For example, for thermal images, temperature consistency and multi-spectral photometric losses are used (Shin et al., 2021, Shin et al., 2022).

2. Network Architectures and Sensor Modalities

Latest frameworks employ modular, end-to-end trainable systems composed of depth and pose networks, often with additional branches for handling inertial data or dynamic scene decomposition:

Depth Estimation: Encoder-decoder (U-Net or ResNet backbones) architectures output dense per-pixel disparity/depth maps. Temporal context is minimally one frame; recent work leverages transformers for spatio-temporal aggregation (Wu et al., 2024, Boulahbal et al., 2022).
Pose Estimation: PoseNet modules (shallow CNNs, sometimes with LSTM or attention mechanisms) regress 6-DoF transformation parameters between frames. Multi-stream architectures can fuse RGB with inferred depth, IMU, or thermal data (Almalioglu et al., 2019, Jiang et al., 2022).
Visual-Inertial Fusion: Some systems eschew explicit time/extrinsic calibrations, ingesting raw accelerometer and gyroscope data (no timestamps) and fusing with visual features via learned attention and sequential models, e.g., bidirectional LSTM, attention gating (Almalioglu et al., 2019, Qu et al., 2022).
Dynamic Motion Modeling: Object motion disentanglement modules decompose global camera and per-instance object motion via rigid and non-rigid (CNN-based deformation) components, enabling robust depth/motion estimation in non-static scenes (Wu et al., 2024, Gao et al., 2020).

Thermal modalities require additional pre-processing, including clipping, colorization, and histogram-based normalization for temporal and edge consistency (Shin et al., 2021, Shin et al., 2022).

3. Self-Supervised Losses and Calibration Handling

A key strength is the generality of self-supervised objectives:

Photometric Reconstruction: Based on L1/SSIM image distance between warped and reference frames.
Geometric/Depth Consistency: Expressed as pixelwise or log-scale differences between co-visible depth maps, warped into the same viewpoint.
Adversarial Losses: Used in SelfVIO to sharpen depth predictions via a local PatchGAN discriminator; the generator is trained to make re-synthesized images indistinguishable from real ones (Almalioglu et al., 2019).
Occlusion and Negative-Depth Handling: Frameworks employ differentiable Z-buffering and explicit penalties for negative reprojected depths (Ziwen et al., 2021).
Sensor Misalignment and Calibration: Systems avoid hand-specified IMU intrinsics or extrinsics, allowing learned attention masks and temporal fusion (e.g., LSTM) to absorb systematic offsets. This has demonstrated robustness to temporal misalignments of tens of ms and spatial misalignments up to 30° or 0.3 m (Almalioglu et al., 2019).

Calibration-free approaches are also seen for non-pinhole camera models via neural ray surface prediction (Vasiljevic et al., 2020).

4. Metric Scale Recovery and Robustness in Challenging Conditions

A major challenge in monocular self-supervision is inherent scale ambiguity. Several strategies have emerged:

Ground Plane Priors: Enforcing agreement between estimated and known camera height above the ground via differentiable proxy losses ensures recovery of metric scale (Wagstaff et al., 2020).
Coarse-to-Fine Refinement: Hybrid systems introduce a two-stage process, using sparse LiDAR for initial scaling, then bi-directional photometric/geometric losses for fine-tuning without reliance on dense ground-truth (Qu et al., 2022).
Inertial Fusion: Incorporating raw IMU data via attention-based fusion improves drift, especially in low-light or high-dynamics scenarios.
Thermal and Multi-Spectral Consistency: Temperature-based and photometric losses across thermal-visible aligned pairs enable self-supervised depth/pose estimation robust to illumination, adverse weather, and zero-light conditions (Shin et al., 2021, Shin et al., 2022).

5. Dynamic Scenes and Motion Disentanglement

Recent research has addressed modeling scene dynamics:

Attentional Separation: Networks split features into static and dynamic channels, with separate decoders for ego-motion (static) and object motion fields (dynamic), using attention-based masks to determine per-pixel trust in rigid or non-rigid flow (Gao et al., 2020).
Object-Level Decomposition: Modules such as DO3D predict per-instance 6-DoF object transforms and pixel-level non-rigid deformations, improving foreground depth and flow estimation in highly dynamic scenes (Wu et al., 2024).
Auto-Selection Masks: Pixels are automatically assigned to dynamic or static models based on local reprojection error, enabling selective warping for accurate photometric/geometric training (Gao et al., 2020).

6. Quantitative Performance and Benchmarks

Across standard benchmarks, state-of-the-art self-supervised methods demonstrate consistent improvement in both depth and odometry metrics:

On KITTI (Eigen split, cap 80 m): SelfVIO achieves AbsRel 0.127 with δ<1.25=0.844, outperforming GeoNet and GANVO (Almalioglu et al., 2019).
KITTI Odometry (09/10): SelfVIO achieves mean t_rel/r_rel 1.88%/1.23°, robust to extrinsic miscalibrations. SelfOdom VO reports t_rel=7.23%, VIO t_rel=5.43%, with significant gains when using inertials (Qu et al., 2022).
EuRoC MAV: SelfVIO achieves ATE down to 0.11–0.19 m, outperforming classical VIO pipelines under calibration errors.
Thermal: On ViViD, thermal-only self-supervised methods achieve indoor AbsRel=0.152 (well-lit) and outdoor AbsRel=0.109 (night), with consistent ATE and RE (Shin et al., 2022).
Dynamic Scenes: Object-aware decomposition achieves AbsRel=0.099 and EPE_fg=9.84 px on KITTI for foreground, outperforming prior static-scene models (Wu et al., 2024).

7. Extensions, Limitations, and Ongoing Directions

Current directions address limitations and seek further generalization:

Improved fusion strategies (multi-layer, attention, transformer-based) for multi-modal (RGB, depth, IMU, thermal) sequences (Jiang et al., 2022, Wu et al., 2024).
Handling sensor misalignment, adverse lighting, and diverse camera models via learned representations and training pipelines without external calibration (Almalioglu et al., 2019, Vasiljevic et al., 2020).
Increased robustness in dynamic and low-texture scenes through explicit object-motion modeling and dynamic masking (Gao et al., 2020, Wu et al., 2024).
Metric scale and domain adaptation via self-supervised scale priors and online retraining (Wagstaff et al., 2020, Qu et al., 2022).
Integration with semantic, multispectral, and inertial cues for application in night-time, weather-degraded, or ad-hoc deployment.

Open problems include residual scale drift or flicker in certain regimes, sensitivity to domain-shift, and efficient scaling to resource-constrained platforms. Recent work demonstrates consistent gains but identifies continued opportunities for enhanced spatio-temporal coherence, more robust egomotion estimation, and seamless deployment on diverse sensors (Almalioglu et al., 2019, Wu et al., 2024).