Uncertainty-Aware Cross-Modality Vehicle Detection

Updated 19 December 2025

The paper introduces UA-CMDet frameworks that model both aleatoric and epistemic uncertainties to enhance vehicle detection robustness.
Methodologies fuse LiDAR, RGB, radar, and IR data using probabilistic approaches, improving AP by up to 13% and mitigating sensor failures.
Empirical results demonstrate significant gains in accuracy and reduced uncertainty mass, ensuring reliable performance in adverse environments.

Uncertainty-Aware Cross-Modality Vehicle Detection (UA-CMDet) refers to a family of object detection frameworks that integrate sensor measurements (e.g., LiDAR, RGB cameras, radar, and infrared) using learned or modeled uncertainty estimates at each detection and fusion stage. These frameworks target robust, calibrated, and accurate vehicle detection under challenging conditions such as sensor misalignment, varying environmental factors, and modality-specific failure. The field encompasses both end-to-end deep architectures with probabilistic components and modular late/post-hoc fusion systems guided by uncertainty, across applications in autonomous driving, drone-based traffic monitoring, and mobile robotics (Feng et al., 2020, Yang et al., 11 Oct 2024, Wirges et al., 2019, Lou et al., 2023, Kowol et al., 2020, Sun et al., 2020).

1. Motivation and Problem Context

UA-CMDet arises from the limitations of single-sensor detection systems, which suffer from inherent modality-specific weaknesses (e.g., LiDAR sparsity in adverse weather, camera unreliability at night, radar's poor spatial resolution, or infrared's redundancy in high illumination). Cross-modality fusion capitalizes on complementary cues, but naïve approaches are vulnerable to sensor miscalibration and inconsistent evidence. Traditional deterministic fusion cannot quantify detection reliability, resulting in overconfident predictions for ambiguous or conflicting inputs. UA-CMDet addresses this by explicitly quantifying both aleatoric (data/sensor-driven) and epistemic (model-driven) uncertainties and incorporating these estimates into the detector's learning objective, proposal selection, fusion weights, or suppression strategies (Feng et al., 2020, Sun et al., 2020, Lou et al., 2023).

2. Uncertainty Modeling Techniques

UA-CMDet frameworks model predictive uncertainty at multiple levels:

Aleatoric Uncertainty: Originating from sensor noise and intrinsic ambiguities, typically modeled as heteroscedastic variance $\sigma^2$ learned by the regression head. For example, bounding box outputs $b_z$ are assumed to follow $b_z \sim \mathcal{N}(u_{b_z},\sigma^2_{b_z})$ (Feng et al., 2020, Lou et al., 2023, Wirges et al., 2019).
Epistemic Uncertainty: Captured by Bayesian model approximations such as Monte Carlo Dropout, yielding empirical variance across $N$ stochastic passes: $u_\mathrm{cls} = -\sum_c \bar{s}_{k,c} \log \bar{s}_{k,c}$ over averaged softmax scores $\bar{s}_{k,c}$ per class (Lou et al., 2023, Wirges et al., 2019).
Classification Uncertainty: Modeled by treating logit outputs as Gaussian random variables, e.g., $l_z \sim \mathcal{N}(u_{l_z}, \sigma^2_{l_z})$ , and propagating them via stochastic sampling or entropy-based criteria (Feng et al., 2020, Lou et al., 2023).
Cross-Modal Uncertainty Metrics: On specific platforms, such as RGB-infrared drone detection, uncertainty weights are derived from cross-modal IoU, image-level illumination, and annotation completeness to reweight losses and guide NMS decisions (Sun et al., 2020).

Uncertainty metrics guide not only loss weighting but also proposal filtering and score fusion to mitigate modality failure or ambiguous evidence (Yang et al., 11 Oct 2024, Kowol et al., 2020).

3. Representative UA-CMDet Architectures

UA-CMDet frameworks span early-fusion, proposal-level fusion, and decision-level fusion designs:

Probabilistic Two-Stage Detector: A LiDAR BEV backbone proposes 3D boxes and associated uncertainties, which are sampled to train a fusion head with both LiDAR and RGB region features. The fusion head refines box regression and classification under uncertainty-aware supervision. Explicit sampling from $p(z|x)$ implements uncertainty-weighted data augmentation during fusion training. Inference requires only predicted means for efficiency. This design is demonstrated to improve $AP_{3D}$ by up to 7% and show markedly higher robustness ( $\leq 5\%$ $AP$ drop vs $>20\%$ for naive fusion) to temporal misalignment (Feng et al., 2020).
Uncertainty-Encoded Mixture-of-Experts (UMoE) Fusion: Each modality is processed by an expert (residual CNN) that ingests proposal confidence, normalized regression/classification uncertainty, and a classification deviation ratio. A gating network produces uncertainty-encoded fusion weights, which are fed into a proposal-level fusion backbone (e.g., CLOCs). UMoE improves AP $_\mathrm{3D}$ by 3–10 points under adverse weather, camera attacks, and blinding scenarios due to uncertainty normalization and selective expert combination (Lou et al., 2023).
Late Fusion via Subjective Logic and Dempster’s Rule: Independently trained 2D and 3D detectors' outputs are paired via IoU, and class evidences are mapped to Dirichlet parameters. Belief masses and uncertainty masses are fused with Dempster's rule and used to suppress high-uncertainty hypotheses. This post-hoc scheme provides up to $+13\%$ $AP_{3D}$ (Easy split) and reduces mean uncertainty mass by $\sim 75\%$ (Yang et al., 11 Oct 2024).
Early Fusion in BEV Grid Maps: All relevant modalities populate a dense BEV grid, processed by a unified convolutional backbone. Uncertainty is estimated per proposal via MC-Dropout and direct modeling, and predictive hulls are constructed to guarantee bounded collision probabilities for planning (Wirges et al., 2019).
Illumination-Aware and Cross-Modal NMS: For RGB/IR aerial detection, uncertainty-aware modules derive per-object, per-head weights reflecting cross-modal alignment (IoU), RGB illumination, and annotation presence. IA-CM-NMS adaptively reduces the impact of unreliable RGB detections in adverse illumination, further improving mAP (Sun et al., 2020).
Meta-Classifier on Modality Features: Modular fusions such as YOdar aggregate per-detection evidential features (e.g., YOLO box confidence, radar-based slice-mean belief, radar variance) and train a gradient boosting decision system to filter and consolidate proposals. This method reduces false positives by $>50\%$ at matched TP and boosts mAP by $8$ points at night (Kowol et al., 2020).

4. Losses, Training Protocols, and Inference Procedures

UA-CMDet frameworks use loss functions that incorporate uncertainty for more informative and robust optimization:

Regression: Heteroscedastic Gaussian negative log-likelihood

$\mathcal{L}_\mathrm{reg} = \frac{1}{2\sigma^2}||b^*-u_b||^2 + \frac12\log \sigma^2$

with per-box, per-coordinate parameterization (Feng et al., 2020, Lou et al., 2023, Wirges et al., 2019).

Classification: Sampled logit cross-entropy, sampled from the predictive Gaussian as $l'_z = u_{l_z} + \sigma_{l_z} \epsilon$ , and passed through softmax (Feng et al., 2020).
Fusion and Expert/Gating Losses: Detection loss on fused proposals, often with sensor-specific detectors frozen, and only expert/gating/fusion weights updated (Lou et al., 2023).
Weighted Loss (Cross-Modal, Cross-Illumination): Regression losses are weighted per sample and per modality by uncertainty weights derived from cross-modal IoU, illumination, and annotation completeness (Sun et al., 2020).
NMS and Uncertainty Thresholding: High-uncertainty detections are suppressed post-fusion, e.g., those with uncertainty mass $u^F > 0.1$ in subjective-fusion (Yang et al., 11 Oct 2024) or via illumination-adjusted scores in IA-CM-NMS (Sun et al., 2020).

Inference generally collapses the probabilistic output to mean parameter values; no MC sampling is necessary (Feng et al., 2020, Lou et al., 2023).

5. Experimental Performance and Ablation Insights

UA-CMDet is validated across diverse real-world datasets and conditions:

KITTI, Bosch, nuScenes, STF: Gains in $AP_{3D}$ (typically 3–10 points) over deterministic or uncertainty-naive baselines. Notable robustness under temporal misalignment—UA-CMDet loses $<5\%$ AP vs. $>20\%$ drop for naive fusion under noise (Feng et al., 2020, Lou et al., 2023, Yang et al., 11 Oct 2024).
DroneVehicle: UA-CMDet achieves $64.01\%$ mAP, outperforming RGB-only by $16.1\%$ and IR-only by $4.86\%$ , with strong results for both day and night scenarios (Sun et al., 2020).
Nighttime, adverse weather, or sensor attack settings: UMoE and YOdar yield up to $10$ points mAP gain, reduce high-uncertainty false positives, and maintain detection fidelity when one modality degrades (Lou et al., 2023, Kowol et al., 2020).

Ablation studies confirm that each stage—uncertainty modeling, encoding/normalization, gating/weighting, and adaptive NMS—contributes non-trivially to robustness and systematic uncertainty reduction. E.g., elimination of UAM or IA-NMS from (Sun et al., 2020) yields up to $1.5\%$ mAP drop.

6. Limitations, Open Problems, and Future Directions

UA-CMDet frameworks commonly share several limitations:

Single-Modality Proposal Dependence: Systems where proposals are from only one sensor (e.g., LiDAR) may miss fully occluded or modality-specific objects (Feng et al., 2020). Generating proposals from all available modalities could mitigate this.
Partial Uncertainty Modeling: Epistemic uncertainty is frequently omitted or limited to the detection heads or backbone. Extending uncertainty quantification to the fusion head, employing Bayesian weights or deep ensembles, or considering uncertainty mass across all stages (as in subjective logic fusion) are active research topics (Yang et al., 11 Oct 2024).
Uncalibrated Uncertainty Measures: Modality-dependent confidence scales are not directly comparable. Approaches such as normalized deviation ratios (Lou et al., 2023) and learned Dirichlet fusion (Yang et al., 11 Oct 2024) partially address this, but cross-dataset calibration remains challenging.
Failure to Salvage Missed Detections: If one modality entirely fails to propose an object, late decision-level fusion cannot rescue the detection (Yang et al., 11 Oct 2024).

Scalability to multi-class detection, adaptation to new sensor types (radar, ultrasonic, event cameras), and integration with end-to-end planning systems represent future directions.

7. Summary of Core Approaches and Empirical Results

The following table summarizes major UA-CMDet approaches and their salient characteristics and empirical improvements:

Approach	Fusion Level	Uncertainty Modeled	Key Gains
Two-Stage Probabilistic (LiDAR+RGB) (Feng et al., 2020)	Proposal + Feature	Aleatoric (LiDAR), entropy on classification	+7% AP, +20% robustness to misalignment
Late Dirichlet/Subjective Fusion (Yang et al., 11 Oct 2024)	Output/Score	Evidence theory vacuity, softmax entropy	+13% AP (Easy), 4x uncertainty reduction
Mixture-of-Experts (UMoE) (Lou et al., 2023)	Proposal	Aleatoric, epistemic, normalized deviation	+10 pts AP in harsh weather/attack scenarios
Early Fusion BEV Map (Wirges et al., 2019)	Backbone	Aleatoric + epistemic via MC-Dropout	mAP~82% (Car, KITTI), improved calibration
RGB-IR Oriented Detection (Sun et al., 2020)	Multi-Head	Cross-modal, illumination, annotation-based	+16.1% mAP over RGB-only
Meta-Classifier (YOdar) (Kowol et al., 2020)	Post-Hoc	Confidence, mean/var on radar-slice evidence	+8 pts mAP (night), halved false positives

Each method demonstrates that principled uncertainty modeling and propagation across modalities and fusion stages is crucial for safe and robust vehicle detection in real-world, safety-critical environments.