RT-MonoDepth: Real-Time Depth Estimation
- RT-MonoDepth is a real-time monocular depth estimation framework designed for embedded systems using a streamlined 4-stage encoder-decoder architecture.
- It optimizes speed and accuracy by leveraging multi-scale upsampling, lightweight element-wise fusion, and self-supervised loss functions inspired by Monodepth2.
- Benchmark results on the KITTI Eigen split show lower AbsRel error (0.125) and higher throughput (253-364 FPS) compared to previous models.
RT-MonoDepth refers to a class of real-time monocular depth estimation architectures specifically tailored for efficient deployment on embedded or resource-constrained platforms. These methods combine compact encoder–decoder designs, streamlined upsampling, and self-supervised or supervised loss formulations to minimize both computational and memory overhead while maintaining or improving upon the depth accuracy of prior fast and standard models (Feng et al., 2023).
1. Architectural Foundations and Design Principles
RT-MonoDepth architectures are characterized by a shallow encoder–decoder framework engineered to minimize GPU latency per layer. The canonical RT-MonoDepth pipeline consists of a 4-stage convolutional encoder and a corresponding decoder with multi-scale upsampling, lightweight fusion, and per-scale prediction heads (Feng et al., 2023).
- Encoder: A 4-level “pyramid” is constructed using ConvBlocks, each comprising three 3×3 convolutions. Downsampling is effected in the first convolution of each block via stride 2, and channels increase stagewise. Batch normalization and depth-wise convolutions are omitted to avoid compute bottlenecks on embedded inference engines.
- Decoder: Upsampling at each stage is implemented by nearest-neighbor interpolation (×2), preceded by a 3×3 convolution to halve the feature channels. Feature fusion across scales utilizes element-wise addition at intermediate stages and channel concatenation at the finest scale.
- Prediction Heads: Each decoder stage appends a shallow two-layer predictor: one 3×3 convolution with LeakyReLU and a second 3×3 with sigmoid activation, producing a single-channel depth estimation per pixel per scale. Secondary heads can be omitted at inference to reduce latency.
This design is streamlined for embedded GPU implementation, eschewing costlier modules such as transposed convolutions, large-kernel convolutions, or deep multi-path fusion (Feng et al., 2023).
2. Loss Functions and Self-Supervision Scheme
RT-MonoDepth inherits its self-supervised training scheme from Monodepth2, based on photometric reconstruction and edge-aware smoothness (Feng et al., 2023).
- Photometric Reconstruction Loss (): At each scale, reconstruction of the target frame is penalized via a sum of per-pixel difference and SSIM-based similarity with source view reconstructions , using a mixing hyperparameter .
- Edge-Aware Smoothness Loss (): Penalizes spatial gradients in predicted (mean-normalized) inverse depth , weighted by the corresponding image gradients:
- Total Loss:
This framework supports self-supervision from monocular video (M) or joint monocular plus stereo (MS) inputs (Feng et al., 2023).
3. Quantitative Performance and Benchmark Evaluation
RT-MonoDepth achieves a favorable speed–accuracy trade-off, delivering higher throughput and lower error than prior fast monocular depth estimation models on the KITTI Eigen split (Feng et al., 2023). The table below summarizes representative results at input resolution 640×192:
| Model | #Params (M) | AbsRel ↓ | FPS Jetson Orin ↑ |
|---|---|---|---|
| FastDepth | 4.0 | 0.168 | – |
| GuideDepth | 5.8 | 0.142 | 155.0 |
| Monodepth2 (M) | 14.3 | 0.132 | 142.3 |
| RT-MonoDepth | 2.8 | 0.125 | 253.0 |
| RT-MonoDepth-S | 1.2 | 0.132 | 364.1 |
RT-MonoDepth achieves an AbsRel of 0.125 at 2.8M parameters and 253 FPS, outperforming FastDepth and GuideDepth by 3–5% in AbsRel/ SqRel with fewer parameters while running 60–100% faster. The lightweight RT-MonoDepth-S variant (1.2M parameters) matches GuideDepth-S in accuracy and substantially exceeds it in throughput, attaining >350 FPS on NVIDIA Jetson Orin (Feng et al., 2023).
4. Implementation, Deployment, and Ablation
The design is fully implemented in PyTorch with FP32 training and TensorRT conversion for FP16 inference. All experiments leverage the KITTI Eigen split, input rescaling to 640×192, and standard Monodepth2-style augmentations (Feng et al., 2023). Measured latencies account for warm-up and are averaged over thousands of frames.
Key ablation findings:
- Supervision at Multiple Scales: Using side output heads for deeper loss supervision yields a marginal improvement in AbsRel at zero inference cost (0.127 vs. 0.128).
- Pyramid Depth: The 4-level encoder provides strong speed–accuracy balance (0.127 AbsRel); dropping to 2 levels degrades accuracy (0.146), while 5 levels gives minimal gain (0.123) at significant speed loss (16.2 FPS vs. 18.4).
- Fusion Strategy: The combination “add, add, concat” at successive stages optimally balances accuracy (0.127) and speed (18.4 FPS). Fully concatenation-based fusion yields a small accuracy gain at notable speed expense.
5. Comparison with Related Real-Time Solutions
RT-MonoDepth and its variants are distinct from other embedded or real-time monocular depth estimators in their mechanism for upsampling and feature fusion. Unlike models that rely on deep residual backbones or extensive multi-scale concatenation, RT-MonoDepth employs early element-wise fusion (computationally cheap) and only a single channel concatenation at the finest scale, thus minimizing memory movement and maximizing embedded GPU occupancy (Feng et al., 2023). Compared to DRNet (“Double Refinement Network”) (Durasov et al., 2018), which achieves up to 18× speedups and 10× lower RAM on NYU Depth v2 via a coarse-to-fine refinement with PixelShuffle upsampling, RT-MonoDepth targets embedded automotive benchmarks with a fully convolutional, nearest-neighbor upsampling pipeline.
Other recently introduced architectures such as RTS-Mono (Cheng et al., 18 Nov 2025) employ hybrid CNN–Transformer encoders and more complex multi-scale sparse fusion, attaining further speed/efficiency gains (e.g., 49 FPS on Jetson Orin, 3.0 M parameters at AbsRel 0.101). A plausible implication is that the field continues to optimize for both architectural simplicity and hardware alignment, trading manual backbone design for greater use of hybrid and block-sparse connectivity in the encoder and decoder.
6. Limitations and Future Research Directions
RT-MonoDepth exhibits a degradation in high-resolution regimes; at 1024×320, AbsRel increases to ≈ 0.14, attributed to the limited receptive field of the lightweight encoder (Feng et al., 2023). This suggests that while the architecture is effective for low-to-moderate resolutions, extending performance to fine-scale edge prediction remains challenging. Potential improvements include adoption of larger convolution kernels or atrous convolutions, though such modifications must not compromise real-time embedded speed—a principal constraint of the framework. Further research may also explore hybrid CNN-transformer encoders, hardware-aware neural architecture search, and incorporation of cross-scale consistency loss terms as utilized in recent works (Cheng et al., 18 Nov 2025).
7. Summary
RT-MonoDepth frameworks, exemplified by (Feng et al., 2023), demonstrate that careful architectural pruning—via standard convolutions, shallow encoders, early simple fusion schemes, and nearest-neighbor upsampling—can yield monocular depth estimation networks that simultaneously achieve competitive accuracy and real-time throughput on embedded systems. These models outperform many past fast approaches in both error and speed, with parameter counts an order of magnitude lower than standard backbones. The field continues to evolve toward hybrid architectures and further optimizations for deployment in autonomous robotics, UAV navigation, and other latency- and power-constrained AI applications.