UA-CMDet: Uncertainty-Aware Cross-Modality Detector

Updated 13 October 2025

The paper introduces a framework that fuses multiple data modalities while quantifying uncertainty at both feature and decision levels for robust performance.
It employs diverse cross-modality fusion strategies such as early feature fusion and late decision fusion enhanced by uncertainty-aware weighting mechanisms.
Empirical evaluations demonstrate improved detection reliability in applications like autonomous driving, aerial surveillance, and medical imaging under noisy conditions.

An Uncertainty-Aware Cross-Modality Detector (UA-CMDet) is a detection framework that fuses the complementary information from multiple data modalities while explicitly quantifying and leveraging uncertainty at both the feature and decision levels. This design predominately addresses the challenges that arise from heterogeneous data sources—including distinct noise profiles, incomplete or corrupted modalities, and ambiguous supervision—by enabling more reliable, calibrated, and interpretable detection performance. UA-CMDet systems have been instantiated across diverse application domains including cross-modal vehicle detection with drone RGB/infrared imagery, medical image analysis, remote sensing, autonomous driving, and open-vocabulary 3D scene understanding.

1. Key Principles of Uncertainty Modeling

UA-CMDet approaches integrate uncertainty modeling either within the feature extraction and attention modules or at the output level of prediction heads. Uncertainty is most often framed as a data-dependent variable that modulates information from each modality according to signal reliability, ambiguity, and completeness.

A typical uncertainty-aware attention mechanism, for example, defines attention weights as random variables:

Each latent attention $z$ is modeled as a sample from a Gaussian $z \sim \mathcal{N}(\mu(x, \omega;\theta), \mathrm{diag}(\sigma^2(x,\omega;\theta)))$ ,
Where $\mu(\cdot)$ and $\sigma(\cdot)$ are learned, input-dependent mean and variance functions that explicitly decouple uncertainty from attention strength,
The final attention is then passed through a nonlinearity: $a = \pi(z)$ (e.g., sigmoid/tanh) (Heo et al., 2018).

In the context of object detection, bounding box coordinates may be represented as distributions rather than point estimates. For example, for each object parameter $k$ , a Laplace or Gaussian distribution $\mathcal{N}(\mu_k, \sigma_k^2)$ or Laplace $(\hat{y}_k, \tilde{b}_k)$ is predicted (Meyer et al., 2019, Park et al., 2021). Modeling the box output as a distribution enables the explicit quantification of aleatoric uncertainty, often through direct parameterization of the regression head.

Uncertainty estimates can stem from Bayesian neural network inference, e.g., via stochastic dropout-based forward passes (Cai et al., 2021), random network prediction and feature density estimation (Wang et al., 2022), or structural innovations such as conformal inference with mutual information calibration (Stutts et al., 2023).

2. Cross-Modality Fusion Strategies

UA-CMDet leverages cross-modal fusion at various network stages:

Early/Feature Fusion: Modality-specific encoders extract features independently, which are then concatenated (along channels or at matching spatial locations) and processed jointly. Fusion may incorporate cross-modal self-attention or transformer blocks to model inter-modal dependencies (Sun et al., 2020, Shen et al., 2022). Residual or attention-based re-weighting using uncertainty maps is also common (Wang et al., 2022).
Late/Decision Fusion: Detection pipelines for individual modalities are trained separately. Their outputs (bounding boxes/labels) are merged using an uncertainty-aware non-maximum suppression (NMS) or an adaptive voting scheme that prioritizes predictions according to their associated uncertainty (Zhang et al., 2023, Lou et al., 2023). Fusion weights are typically computed as inverse-variance or via a gating network in a mixture-of-experts formulation (Lou et al., 2023).
Unified Transformers and Agglomeration: Recent work with vision transformers unifies the representation space so a single architecture seamlessly processes different modalities or even mixed inputs via patch-level or regional interleaving (Shen et al., 2022). Foundation model agglomeration accumulates representations from multiple pre-trained teachers (e.g., CLIP, DINOv2, Stable Diffusion) and distills their joint knowledge with per-teacher uncertainty weighting (Li et al., 20 Mar 2025).
Self-Supervised and Weakly-Supervised Label Transfer: Labels are automatically transferred across modalities, for instance using 3D detections from depth projected into 2D RGB for YOLO student training (Mehboob et al., 5 Nov 2024).

3. Loss Functions and Calibration

UA-CMDet architectures require loss functions that not only minimize prediction and localization error, but also explicitly account for uncertainty:

Probabilistic Losses: Negative log-likelihood (NLL) or Kullback-Leibler divergence (KLD) losses are introduced for distributional outputs, e.g.,

$\mathcal{L}_{\text{KLD}}(y, \hat{y}, b, \tilde{b}) = \log(\tilde{b}/b) + \frac{b \exp(-|y-\hat{y}|/b) + |y-\hat{y}|}{\tilde{b}} - 1$

(Meyer et al., 2019).

Uncertainty-Weighted Regression: Localization or classification losses are modulated by uncertainty,

$\mathcal{L}_{\text{loc}}(\mathbf{t}^u, \mathbf{v}, \omega) = \omega \cdot \sum_{i \in \{x, y, w, h, \theta\}} \text{smooth}_{L_1}(t^u_i - v_i)$

(Sun et al., 2020).

Conformal Inference and Quantile Regression: Predictive intervals are calibrated via conformal inference, with loss components adjusting interval sharpness adaptively using normalized mutual information:

$\mathcal{L}_{\text{Total}} = \text{SmoothL1}_{\text{loss}}(y, \hat{y}) \times (1 + 0.01 U) + \mathrm{KL}_{\text{div}}(\mu_{\text{joint}}, V_{\text{joint}}) + \text{INTSCORE}_{\text{loss}} + \text{COMCAL}_{\text{loss}}$

(Stutts et al., 2023).

Calibration of the model’s estimated uncertainty is routinely evaluated via reliability diagrams, expected calibration error (ECE), and downstream deferral options (e.g., "I don't know" predictions).

4. Performance Evaluation and Robustness

UA-CMDet frameworks are empirically validated in challenging, cross-modal detection scenarios:

Robustness to Noise/Adverse Conditions: Experiments on KITTI and DroneVehicle datasets show that multi-pipeline uncertainty-aware fusion outperforms single-modality systems when some sensor channels are affected by Gaussian noise, motion blur, or environmental occlusions (Sun et al., 2020, Zhang et al., 2023).
Effectiveness of Uncertainty Weighting: Introduction of uncertainty-based NMS or gating mechanisms yields substantial improvements in mean average precision (mAP) and recall—especially notable under rare category, occluded, or low-illumination conditions (Lou et al., 2023).
Calibration and Deferral: Correctly calibrated uncertainty, as quantified by ECE or deferral rate metrics, enables effective risk-sensitive decision control in domains with high operational risk, such as autonomous driving or medical diagnostics (Heo et al., 2018).
Zero-Shot and Cross-Domain Transfer: Frameworks based on background-independent deviation metrics or feature-level compactness attain zero-shot transfer across modalities, realizing cross-domain anomaly detection in remote sensing and medical registration (Li et al., 2023, Liu et al., 30 Jul 2025).

5. Practical Implementations and Applications

UA-CMDet systems have been successfully applied in multiple domains:

Autonomous Driving: Sensor fusion (LiDAR + RGB), robust to weather and adversarial attacks, supports accurate and reliable planning (Meyer et al., 2019, Lou et al., 2023, Stutts et al., 2023).
Aerial and Remote Sensing: Drone-based cross-modal (RGB–infrared) vehicle detection with uncertainty-aware fusion and NMS for surveillance, traffic management, and disaster response (Sun et al., 2020).
Medical Imaging: Cross-modal image registration and segmentation, with models such as VoxelMorph and PAU explicitly quantifying alignment or retrieval uncertainty, critical for treatment planning (Liu et al., 30 Jul 2025, Li et al., 2023).
Cross-Language and Open-Vocabulary Classification: Vision-language detectors and feature matchers leverage uncertainty to improve zero-shot prediction and alignments in open set or noisy retrieval conditions (Wang et al., 2022, Li et al., 20 Mar 2025).
Industrial Inspection/Recognition in Data-Scarce Applications: Self-supervised cross-modality bootstrapping, where knowledge is transferred from 3D to 2D using a limited set of labeled samples (Mehboob et al., 5 Nov 2024).

6. Challenges, Limitations, and Future Research

Known challenges for UA-CMDet frameworks include:

Heterogeneous Noise Profiles: Modality-specific uncertainty must be carefully calibrated and sometimes harmonized for joint interpretation (Sun et al., 2020, Li et al., 20 Mar 2025).
Computational Overhead: Bayesian/dropout-based approaches or Monte Carlo sampling introduce runtime costs, which recent conformal inference and random network prediction proposals mitigate (Wang et al., 2022, Stutts et al., 2023).
Complexity of Uncertainty Fusion: Learning to map, aggregate, or adaptively fuse uncertainty across modalities with distinct scales and reliability remains a central modeling and practical obstacle (Lou et al., 2023).

Future research directions include exploring non-Gaussian uncertainty models (Qin et al., 21 May 2025), deeper self-supervised and semi-supervised cross-modal pipelines (Mehboob et al., 5 Nov 2024), improved fusion methods using graph or transformer attention (Shen et al., 2022, Liu et al., 30 Jul 2025), and broader deployment in real-time safety-critical settings. Model-agnostic frameworks for uncertainty quantification, such as Uncertainty-o, offer means to standardize uncertainty evaluation across evolving large multimodal architectures (Zhang et al., 9 Jun 2025).

7. Theoretical and Methodological Foundations

Theoretical motifs underlying UA-CMDet design include:

Input-Dependent Variational Inference: Direct modeling of input-dependent uncertainty distributions for attention mechanisms and outputs (Heo et al., 2018).
Energy-Based Rankings: Reformulating detection as a conditional probability (or energy) ranking between normal and anomalous features, enabling robust cross-domain anomaly detection (Li et al., 2023).
Prototype and Evidence-Based Uncertainty: DST-based frameworks that relate feature-prototype evidence variation to Dirichlet uncertainty, supporting more semantically structured uncertainty estimation in retrieval (Li et al., 2023).
Random Network Prediction for Feature Density: Using mismatches between learnable and fixed random network outputs as uncertainty surrogates (Wang et al., 2022).
Mutual Information and Conformal Calibration: Leveraging mutual information as a modulator of conformal intervals in multimodal 3D detection pipelines (Stutts et al., 2023).

A plausible implication is that UA-CMDet will continue to co-evolve with advances in uncertainty quantification and modality-aware representation learning, with increasing integration into open-world, data-scarce, and dynamically adaptive machine perception systems.