2000 character limit reached

Monocular Depth Estimation on Embedded Systems

Updated 15 July 2025

Monocular depth estimation on embedded systems is a method that uses lightweight, hardware-aware neural networks to infer detailed 3D information from a single RGB image.
Architectural strategies include pyramid-based feature extraction, MobileNet backbones, and efficient upsampling techniques, enabling low memory usage and real-time performance on resource-constrained devices.
Training and compression techniques leverage self-supervised learning, multi-scale loss functions, pruning, and 8-bit quantization to balance accuracy and energy efficiency for diverse applications.

Monocular depth estimation on embedded systems refers to the inference of dense depth maps from single RGB images using algorithms and models that are specifically optimized for real-time performance, low memory consumption, and power efficiency on resource-constrained platforms. This research area sits at the intersection of computer vision, deep learning, and hardware-aware model design, and is pivotal for enabling cost-effective 3D perception in applications such as robotics, autonomous navigation, assistive technologies, and augmented reality where hardware budgets are limited and real-time inference is critical.

1. Core Architectural Strategies for Embedded Monocular Depth Estimation

Monocular depth estimation methods for embedded systems are predominantly characterized by lightweight encoder–decoder architectures, pyramid-based feature extractors, efficient upsampling blocks, and an increasing trend toward hardware-aware neural operations.

Pyramidal and Hierarchical Encoders: Architectures such as PyD-Net use a pyramidal hierarchy where multi-scale features are extracted at successively coarser resolutions and progressively refined through a lightweight series of decoders (Poggi et al., 2018). This strategy reduces the number of model parameters to as little as 6% of a typical large network and minimizes memory usage, allowing inference even on devices like Raspberry Pi.
MobileNet Backbones and Depthwise Separable Convolutions: FastDepth and other modern networks employ MobileNet as the encoder, leveraging depthwise separable convolutions to reduce multiply–accumulate operations (MACs) by an order of magnitude compared to conventional convolutions (Wofk et al., 2019). This greatly decreases computational cost and model size.
Efficient Decoding and Upsampling: Techniques such as nearest-neighbor interpolation, lightweight convolutional upsampling, and guided image filter-inspired upsampling blocks (GUB) enable the production of high-resolution depth maps without computationally intensive transposed convolutions (Rudolph et al., 2022).
Edge Guidance, Context Fusion, and Attention Mechanisms: Some recent designs incorporate explicit edge guidance or transformer-inspired modules. For example, an Edge Guided Depth Estimation Network integrates edge-attention branches and a transformer-based feature aggregation module to improve depth estimation around object boundaries, while maintaining a total parameter count in the low millions (Dong et al., 2022).
Hybrid, Recurrent, and Token-Efficient Designs: MiniNet (Liu et al., 2020) employs a recurrent module to simulate a deep network with parameter reuse, while the Token-Sharing Transformer (TST) (Lee et al., 2023) shares a global context token across local features, offering transformer-level accuracy at a fraction of the computational cost.

Table: Typical Parameter and Throughput Ranges

Network/Approach	Parameters (Millions)	Throughput (FPS, Embedded)
PyD-Net	1.9	8 Hz (CPU), 1.7 s/frame (RPi3)
FastDepth	1.34	27 (TX2 CPU), 178 (TX2 GPU)
MiniNet	0.217	2 (RPi3), 37 (CPU), 110 (GPU)
Edge Guided Net	2.21	96 (GTX 1080 GPU)
RT-MonoDepth	2.8	18.4 (Nano), 253 (AGX Orin)
GuideDepth	1.9	35.1 (Nano), 144.5 (Xavier NX)
TST	1.27–1.8	63.4 (Nano), 142.6 (TX2)
LMDepth	2.9	122 (Xavier, INT8 quantized)

(Sources: (Poggi et al., 2018, Wofk et al., 2019, Liu et al., 2020, Dong et al., 2022, Feng et al., 2023, Rudolph et al., 2022, Lee et al., 2023, Long et al., 2 May 2025))

2. Training Techniques and Loss Functions

Embedded-focused monocular depth estimation models frequently exploit unsupervised or self-supervised learning frameworks to avoid dependence on dense ground truth depth labels, further streamlining the pipeline for practical scenarios.

Image Reconstruction as Supervision: Many systems cast depth estimation as an image reconstruction task, where a predicted disparity map is used to warp one view to reconstruct another. The loss functions typically integrate a combination of pixel-wise L1 losses, SSIM (Structural Similarity Index Measure), and edge- or smoothness-aware regularization terms (Poggi et al., 2018).
Multi-Scale and Hierarchical Losses: Loss terms are often computed at multiple output resolutions, enabling early supervision and mitigating vanishing gradients. A generic multi-scale loss for scale s can be written as:

$\mathcal{L}_s = \alpha_{ap}(\mathcal{L}_{ap}^l + \mathcal{L}_{ap}^r) + \alpha_{ds}(\mathcal{L}_{ds}^l + \mathcal{L}_{ds}^r) + \alpha_{lr}(\mathcal{L}_{lr}^l + \mathcal{L}_{lr}^r)$

where appearance loss $\mathcal{L}_{ap}$ , disparity smoothness loss $\mathcal{L}_{ds}$ , and left-right consistency loss $\mathcal{L}_{lr}$ are computed for both views as appropriate.

Self-Supervision and Proxy Labels: Some frameworks generate proxy ground truth via classical stereo algorithms (e.g., Semi-Global Matching) or inject geometric priors from sparse visual odometry output using autoencoders that densify these priors for network guidance (Tosi et al., 2019, Andraghetti et al., 2019).
Balanced and Structure-Aware Losses: Recent networks adopt multi-term losses to simultaneously enforce pixelwise accuracy, gradient consistency, normal vector alignment, and perceptual similarity via SSIM, leading to better preservation of object boundaries and fine structures (Papa et al., 13 Mar 2024).

3. Model Compression, Quantization, and Deployment Strategies

To ensure real-time operation under severe memory and power constraints, state-of-the-art depth models employ extensive optimization and deployment techniques:

Network Pruning: Pruning methods such as NetAdapt automatically remove channels/layers while monitoring accuracy and latency directly on target hardware, yielding up to 2–3× parameter reduction without significant accuracy loss (Wofk et al., 2019).
Quantization: Static (post-training) quantization is widely adopted; 8-bit INT quantization reduces model size (e.g., from 26 MB to 2.63 MB) and doubles inference speed with minimal impact on accuracy (Long et al., 2 May 2025).
Inference Optimization Frameworks: Toolchains such as TensorRT (NVIDIA) and TVM are used for operator fusion, memory layout optimization, and hardware-specific kernel selection, further improving runtime performance (Wofk et al., 2019, An et al., 2021).
BatchNorm and Convolution Choices: For some devices, standard convolutions outperform depthwise convolutions due to better hardware support, and batch normalization layers are often omitted to further reduce latency and improve stability with small batch sizes (Feng et al., 2023).

4. Performance Metrics and Real-World Evaluation

Common evaluation protocols for monocular depth estimation on embedded systems utilize the following quantitative metrics:

Absolute Relative Error (Abs Rel):

$\text{AbsRel} = \frac{1}{N} \sum_{i} \frac{|d_i - \hat{d}_i|}{d_i}$

Root Mean Squared Error (RMSE):

$\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i} (d_i - \hat{d}_i)^2}$

Threshold Accuracy ( $\delta$ ):

$\delta_i = \text{percentage of } d_i \text{ satisfying } \max\left(\frac{d_i}{\hat{d}_i}, \frac{\hat{d}_i}{d_i}\right) < \text{threshold}$

with the thresholds commonly set at 1.25, 1.25², and 1.25³.

Frame Rate and Energy Efficiency: Models report throughput (frames per second) and often "images/second/watt" (Wang et al., 2020), demonstrating real-time capabilities (e.g., 122 FPS at INT8 precision on NVIDIA Xavier (Long et al., 2 May 2025)) and high energy efficiency.

5. Applications across Robotics, AR/VR, and Assistive Devices

The models and methodologies developed for embedded monocular depth estimation are applied in diverse domains:

Mobile Robotics and UAVs: Real-time depth maps are indispensable for obstacle avoidance, 3D mapping, and localization, with typical requirements of >20 FPS for smooth operation (Wofk et al., 2019, Lee et al., 2023).
Assistive Technologies: Integrated vision systems combining depth estimation and object detection (e.g., DPT Hybrid MiDaS and YOLOv8m) provide real-time audio or haptic feedback for visually impaired users, with quantized models executed on platforms such as Raspberry Pi (Anjom et al., 10 Jul 2025).
Augmented/Virtual Reality: Low-latency, energy-efficient models enable immersive scene understanding and object anchoring on wearable headsets and mobile devices.
Underwater and Adverse Domains: Specialized network designs (UDepth) introduce domain priors, such as underwater light attenuation, and deploy highly compact models with custom input spaces (e.g., RMI: red, max of green/blue, intensity channels) for use on low-cost underwater robots (Yu et al., 2022).

6. Limitations, Trade-offs, and Open Challenges

Despite significant advances, there remain several challenges and trade-offs:

Accuracy vs. Efficiency: Model simplification, pruning, and quantization can marginally reduce accuracy (e.g., up to 2% SSIM reduction with 75% parameter reduction (Patwari et al., 2022)), necessitating careful design to meet application-specific requirements.
Decoder Latency Bottleneck: While encoders have been extensively optimized, upsampling and dense decoding remain the main contributors to runtime; further innovation in decoder architectures may yield additional gains (Wofk et al., 2019).
Hardware Constraints: Operations such as depthwise convolution and post-training quantization may not always yield optimal speedups on all embedded hardware, requiring platform-specific adjustments (Feng et al., 2023).

7. Recent Innovations and Future Research Directions

Recent years have seen the integration of components from transformers (e.g., token sharing, lightweight vision transformers), linear state space models (Mamba blocks), and biologically motivated cues (semantic segmentation, size priors, and language embeddings) for further gains in generalization and edge case handling (Lee et al., 2023, Long et al., 2 May 2025, Auty et al., 2022). Advanced data augmentation, structure-aware distillation, and proxy signal exploitation (e.g., SLAM for metric scaling (Choi et al., 2022)) represent promising avenues for further enhancing robustness and scaling to more challenging and varied environments, including across domain shifts and sensor modalities.

In conclusion, monocular depth estimation on embedded systems is a rapidly evolving research area characterized by architectural innovations, robust training frameworks, and practical deployment strategies that enable high-fidelity 3D perception within the compute and energy constraints of modern edge platforms. The field continues to progress toward greater efficiency, accuracy, and adaptability, broadening the availability of 3D vision capabilities across increasingly accessible hardware platforms.