XD-RCDepth: Efficient Radar–Camera Depth Estimation

Updated 17 October 2025

XD-RCDepth is a lightweight radar-camera depth estimation framework that uses MobileNetV2 backbones and FiLM modules to achieve a 29.7% reduction in parameters while preserving competitive accuracy.
The method incorporates explainability-aligned distillation using Grad-CAM and depth-distribution distillation to enhance interpretability and calibration of the depth predictions.
Empirical evaluations on benchmarks like nuScenes show up to a 7.97% reduction in MAE and real-time inference at ~15 FPS, making it ideal for autonomous driving applications.

XD-RCDepth refers to a lightweight radar–camera depth estimation framework that combines advanced fusion architecture with novel distillation strategies to achieve both high accuracy and real-time efficiency for autonomous driving and related applications (Sun et al., 15 Oct 2025). By leveraging efficient feature fusion and explainability-aligned distillation, XD-RCDepth maintains competitive prediction quality while significantly reducing model complexity relative to prior state-of-the-art methods.

1. Model Architecture and Fusion Design

XD-RCDepth adopts MobileNetV2 backbones for both radar and camera image streams, enabling substantial parameter reduction compared to heavier networks such as ResNet-34 used in teacher architectures (e.g., CaFNet). Features are extracted at multiple resolutions (1/2, 1/4, ..., 1/32 scale), with spatial correspondence ensured at each stage.

Radar–camera fusion is mediated by compact Feature-wise Linear Modulation (FiLM) modules at each scale. For input radar features $f_r$ and image features $f_i$ , the fusion is achieved as: $\gamma = \text{Conv}_{1\times 1}(f_r), \quad \beta = \text{Conv}_{1\times 1}(f_r)$

$f_{fuse} = (1 + \gamma) \odot f_i + \beta$

where $\odot$ denotes element-wise multiplication. This scheme conditions image features directly on radar characteristics, allowing the fused representation to respond adaptively to geometric cues.

The decoder employs a point-wise Dense Atrous Spatial Pyramid Pooling (DASPP) block, entailing parallel $1\times1$ convolutions at various dilation rates across stages. This approach ensures large receptive fields and effective context aggregation with minimal parameter overhead.

This design attains a $29.7\%$ reduction in parameter count (8.89M parameters) relative to lightweight benchmarks such as LiRCDepth, yet preserves competitive accuracy.

2. Explainability-Aligned and Distribution-Aware Distillation

XD-RCDepth introduces two knowledge-distillation techniques to address the challenges of performance degradation due to model compression and to promote interpretability of student predictions.

A. Explainability-Aligned Distillation

This strategy aligns the student’s internal saliency structure with that of the teacher via feature-driven Grad-CAM–style maps. For selected intermediate layers $l$ , feature maps $F_l^{T/S}$ are used to generate saliency maps: $\text{Map}_l^{(\cdot)} = \text{ReLU}\left(\sum_{c=1}^{C_l} \alpha_{l,c}^{(\cdot)} F_{l,c}^{(\cdot)}\right)$ Maps are flattened, $\ell_2$ normalized, and compared via cosine loss: $\ell_{map,l} = 1 - \langle \hat{a}_l^S, \hat{a}_l^T \rangle$ The overall explainability distillation loss averages this alignment over selected layers. This encourages the student to focus on the same geometrically relevant regions as the teacher, resulting in more interpretable outputs and facilitating diagnostic transparency.

B. Depth-Distribution Distillation

To recast continuous depth prediction into a structured soft classification, depth values are discretized into $B$ bins with centers $\{c_i\}$ over the interval $[d_{min}, d_{max}]$ . For each pixel $p$ : $\Delta_i^{(\cdot)}(p) = |d^{(\cdot)}(p) - c_i|$

$z_i^{(\cdot)}(p) = -\Delta_i^{(\cdot)}(p)$

$p^S(p) = \text{softmax}(z_i^S(p)/\tau) \quad q^T(p) = \text{softmax}(z_i^T(p)/\tau)$

KL divergence from teacher to student bin distributions is minimized: $L_{D^2-KD} = \frac{\tau^2}{|\Omega|}\sum_{p\in\Omega} \sum_{i=1}^{B} q_i^T(p)\log \left( \frac{q_i^T(p)}{p_i^S(p)} \right)$ This method improves optimization and calibration by treating depth regression as structured probabilistic inference, enhancing spatial consistency and lowering local error.

3. Empirical Performance and Efficiency

On public benchmarks (nuScenes, ZJU-4DRadarCam), XD-RCDepth achieves reduction in Mean Absolute Error (MAE) by up to $7.97\%$ relative to direct training and outperforms lightweight competitors in several metrics including RMSE, AbsRel, and $\delta_1$ threshold accuracy. For instance, on nuScenes at 50m:

MAE: Reduced from 1.746 (no distillation) to 1.608 (with both distillation strategies).
On ZJU-4DRadarCam at 50m: MAE improved from 1.218 to 1.155.

Real-time efficiency is realized with inference time of 0.015 seconds per frame, yielding throughput of ~15 FPS on standard hardware.

4. Interpretability and Visual Explanation

Explainability-aligned distillation endows the lightweight student network with the teacher’s discriminative geometry focus, as evidenced by convergent Grad-CAM maps in both intermediate and output layers. The distilled student produces sharper, more localizable depth saliencies, elucidating critical spatial features such as object boundaries and occlusion regions. Visual analysis demonstrates improved depth discontinuities and spatial consistency over both raw and non-distilled students.

5. Comparative Advantages and Practical Scope

Relative to state-of-the-art teacher models (e.g., CaFNet) and competitive lightweights (LiRCDepth), XD-RCDepth delivers:

Comparable or improved estimation accuracy at a substantial reduction in computational and storage requirements.
Enhanced interpretability via saliency transfer, facilitating regulatory and diagnostic deployment.
Real-time capability suitable for autonomous vehicle and ADAS scenarios where computational and latency constraints are critical.

Its design allows broad applicability for both end-to-end fusion depth pipelines and modular integration into existing radar–camera systems, with explainability-aligned outputs supporting regulatory compliance and post-hoc auditing.

6. Impact, Limitations, and Future Directions

The fusion strategy and dual distillation methods set a template for radar–camera depth estimation optimized for resource-constrained and high-throughput environments. While the depth-distribution distillation yields clear optimization benefits and X-KD fosters interpretable networks, further investigation may refine bin partitioning and multi-modal saliency propagation for more complex urban scenarios or adverse environments.

A plausible implication is that explainability-aware distillation can be generalized to related sensor fusion domains, providing a pathway to trustworthy, real-time perception networks. Potential limitations include dependence on accurate saliency extraction from the teacher and the need for further robustness studies under extreme weather or occlusion.

In sum, XD-RCDepth presents an efficient, explainable, and accurate solution for radar-camera depth estimation, validated on representative benchmarks and supported by detailed architectural and empirical analysis (Sun et al., 15 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation (2025)

Follow Topic

Get notified by email when new papers are published related to XD-RCDepth.