Papers
Topics
Authors
Recent
Search
2000 character limit reached

RT-MonoDepth: Real-Time Depth Estimation

Updated 13 February 2026
  • RT-MonoDepth is a real-time monocular depth estimation framework designed for embedded systems using a streamlined 4-stage encoder-decoder architecture.
  • It optimizes speed and accuracy by leveraging multi-scale upsampling, lightweight element-wise fusion, and self-supervised loss functions inspired by Monodepth2.
  • Benchmark results on the KITTI Eigen split show lower AbsRel error (0.125) and higher throughput (253-364 FPS) compared to previous models.

RT-MonoDepth refers to a class of real-time monocular depth estimation architectures specifically tailored for efficient deployment on embedded or resource-constrained platforms. These methods combine compact encoder–decoder designs, streamlined upsampling, and self-supervised or supervised loss formulations to minimize both computational and memory overhead while maintaining or improving upon the depth accuracy of prior fast and standard models (Feng et al., 2023).

1. Architectural Foundations and Design Principles

RT-MonoDepth architectures are characterized by a shallow encoder–decoder framework engineered to minimize GPU latency per layer. The canonical RT-MonoDepth pipeline consists of a 4-stage convolutional encoder and a corresponding decoder with multi-scale upsampling, lightweight fusion, and per-scale prediction heads (Feng et al., 2023).

  • Encoder: A 4-level “pyramid” is constructed using ConvBlocks, each comprising three 3×3 convolutions. Downsampling is effected in the first convolution of each block via stride 2, and channels increase stagewise. Batch normalization and depth-wise convolutions are omitted to avoid compute bottlenecks on embedded inference engines.
  • Decoder: Upsampling at each stage is implemented by nearest-neighbor interpolation (×2), preceded by a 3×3 convolution to halve the feature channels. Feature fusion across scales utilizes element-wise addition at intermediate stages and channel concatenation at the finest scale.
  • Prediction Heads: Each decoder stage appends a shallow two-layer predictor: one 3×3 convolution with LeakyReLU and a second 3×3 with sigmoid activation, producing a single-channel depth estimation per pixel per scale. Secondary heads can be omitted at inference to reduce latency.

This design is streamlined for embedded GPU implementation, eschewing costlier modules such as transposed convolutions, large-kernel convolutions, or deep multi-path fusion (Feng et al., 2023).

2. Loss Functions and Self-Supervision Scheme

RT-MonoDepth inherits its self-supervised training scheme from Monodepth2, based on photometric reconstruction and edge-aware smoothness (Feng et al., 2023).

  • Photometric Reconstruction Loss (LphotoL_{\text{photo}}): At each scale, reconstruction of the target frame ItI_t is penalized via a sum of per-pixel 1\ell_1 difference and SSIM-based similarity with source view reconstructions IstI_{s\rightarrow t}, using a mixing hyperparameter α=0.85\alpha=0.85.

Lphoto=αi1SSIM(It(i),Ist(i))2+(1α)iIt(i)Ist(i)L_{\text{photo}} = \alpha \sum_i \frac{1-\mathrm{SSIM}(I_t(i), I_{s\rightarrow t}(i))}{2} + (1-\alpha)\sum_i |I_t(i) - I_{s\rightarrow t}(i)|

  • Edge-Aware Smoothness Loss (LsmoothL_{\text{smooth}}): Penalizes spatial gradients in predicted (mean-normalized) inverse depth d(i)d(i), weighted by the corresponding image gradients:

Lsmooth=i[xd(i)exIt(i)+yd(i)eyIt(i)]L_{\text{smooth}} = \sum_{i} \left[ |\partial_x d(i)| e^{-|\partial_x I_t(i)|} + |\partial_y d(i)| e^{-|\partial_y I_t(i)|} \right]

  • Total Loss: Ltotal=Lphoto+λsmoothLsmoothL_{\text{total}} = L_{\text{photo}} + \lambda_{\text{smooth}} L_{\text{smooth}}

This framework supports self-supervision from monocular video (M) or joint monocular plus stereo (MS) inputs (Feng et al., 2023).

3. Quantitative Performance and Benchmark Evaluation

RT-MonoDepth achieves a favorable speed–accuracy trade-off, delivering higher throughput and lower error than prior fast monocular depth estimation models on the KITTI Eigen split (Feng et al., 2023). The table below summarizes representative results at input resolution 640×192:

Model #Params (M) AbsRel ↓ FPS Jetson Orin ↑
FastDepth 4.0 0.168
GuideDepth 5.8 0.142 155.0
Monodepth2 (M) 14.3 0.132 142.3
RT-MonoDepth 2.8 0.125 253.0
RT-MonoDepth-S 1.2 0.132 364.1

RT-MonoDepth achieves an AbsRel of 0.125 at 2.8M parameters and 253 FPS, outperforming FastDepth and GuideDepth by 3–5% in AbsRel/ SqRel with fewer parameters while running 60–100% faster. The lightweight RT-MonoDepth-S variant (1.2M parameters) matches GuideDepth-S in accuracy and substantially exceeds it in throughput, attaining >350 FPS on NVIDIA Jetson Orin (Feng et al., 2023).

4. Implementation, Deployment, and Ablation

The design is fully implemented in PyTorch with FP32 training and TensorRT conversion for FP16 inference. All experiments leverage the KITTI Eigen split, input rescaling to 640×192, and standard Monodepth2-style augmentations (Feng et al., 2023). Measured latencies account for warm-up and are averaged over thousands of frames.

Key ablation findings:

  • Supervision at Multiple Scales: Using side output heads for deeper loss supervision yields a marginal improvement in AbsRel at zero inference cost (0.127 vs. 0.128).
  • Pyramid Depth: The 4-level encoder provides strong speed–accuracy balance (0.127 AbsRel); dropping to 2 levels degrades accuracy (0.146), while 5 levels gives minimal gain (0.123) at significant speed loss (16.2 FPS vs. 18.4).
  • Fusion Strategy: The combination “add, add, concat” at successive stages optimally balances accuracy (0.127) and speed (18.4 FPS). Fully concatenation-based fusion yields a small accuracy gain at notable speed expense.

RT-MonoDepth and its variants are distinct from other embedded or real-time monocular depth estimators in their mechanism for upsampling and feature fusion. Unlike models that rely on deep residual backbones or extensive multi-scale concatenation, RT-MonoDepth employs early element-wise fusion (computationally cheap) and only a single channel concatenation at the finest scale, thus minimizing memory movement and maximizing embedded GPU occupancy (Feng et al., 2023). Compared to DRNet (“Double Refinement Network”) (Durasov et al., 2018), which achieves up to 18× speedups and 10× lower RAM on NYU Depth v2 via a coarse-to-fine refinement with PixelShuffle upsampling, RT-MonoDepth targets embedded automotive benchmarks with a fully convolutional, nearest-neighbor upsampling pipeline.

Other recently introduced architectures such as RTS-Mono (Cheng et al., 18 Nov 2025) employ hybrid CNN–Transformer encoders and more complex multi-scale sparse fusion, attaining further speed/efficiency gains (e.g., 49 FPS on Jetson Orin, 3.0 M parameters at AbsRel 0.101). A plausible implication is that the field continues to optimize for both architectural simplicity and hardware alignment, trading manual backbone design for greater use of hybrid and block-sparse connectivity in the encoder and decoder.

6. Limitations and Future Research Directions

RT-MonoDepth exhibits a degradation in high-resolution regimes; at 1024×320, AbsRel increases to ≈ 0.14, attributed to the limited receptive field of the lightweight encoder (Feng et al., 2023). This suggests that while the architecture is effective for low-to-moderate resolutions, extending performance to fine-scale edge prediction remains challenging. Potential improvements include adoption of larger convolution kernels or atrous convolutions, though such modifications must not compromise real-time embedded speed—a principal constraint of the framework. Further research may also explore hybrid CNN-transformer encoders, hardware-aware neural architecture search, and incorporation of cross-scale consistency loss terms as utilized in recent works (Cheng et al., 18 Nov 2025).

7. Summary

RT-MonoDepth frameworks, exemplified by (Feng et al., 2023), demonstrate that careful architectural pruning—via standard convolutions, shallow encoders, early simple fusion schemes, and nearest-neighbor upsampling—can yield monocular depth estimation networks that simultaneously achieve competitive accuracy and real-time throughput on embedded systems. These models outperform many past fast approaches in both error and speed, with parameter counts an order of magnitude lower than standard backbones. The field continues to evolve toward hybrid architectures and further optimizations for deployment in autonomous robotics, UAV navigation, and other latency- and power-constrained AI applications.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RT-MonoDepth.