Uncertainty-Aware Noisy-Or Fusion (UNO)
- UNO is a multimodal, uncertainty-aware fusion framework that integrates deep segmentation predictions using spatially adaptive calibration and a probabilistic noisy-or rule.
- It dynamically recalibrates per-modality softmax outputs with spatial temperature scaling based on both epistemic and aleatoric uncertainty to improve robustness under various degradations.
- Empirical evaluations show that UNO significantly boosts mean IoU, outperforming prior methods by up to 17.5 percentage points especially under challenging, out-of-distribution conditions.
Uncertainty-Aware Noisy-Or Fusion (UNO) is a multimodal late-fusion framework for semantic segmentation that combines independent deep networks (“experts”) for each input modality (e.g., RGB, depth), integrating both probabilistic uncertainty estimates and spatially adaptive calibration. UNO aims to achieve robustness to both familiar and previously unseen input degradations by dynamically tempering each expert’s confidence according to estimated epistemic and aleatoric uncertainties, then combining the recalibrated predictions via a probabilistically principled noisy-or rule. This approach yields significant gains over prior state-of-the-art models, especially under challenging or out-of-distribution corruptions (Tian et al., 2019).
1. Model Architecture and Inference Pipeline
UNO employs a late-fusion architecture, in which an independent segmentation network is assigned to each sensor modality. During inference, for each modality , the input is processed by a pre-trained deep network to produce a set of per-class logits: In standard approaches, per-class probabilities are obtained via the softmax function: UNO interposes an uncertainty-aware recalibration of the softmax temperature, modulating the logits according to both global and spatial indicators of input quality. The resulting uncertainty-scaled probabilities from all modalities are finally fused per class by a noisy-or assignment, preserving high confidence even if only a single expert is confident.
2. Uncertainty Quantification
UNO computes three independent uncertainty metrics per pixel, averaging each spatially for an overall score. These are:
- Predictive entropy: Quantifies total uncertainty under Monte Carlo dropout (MCDO) forward passes,
- Mutual information: Captures epistemic (model) uncertainty as the difference between predictive entropy and expected entropy over dropout samples,
- Deterministic (single-pass) entropy:
Each metric delivers a distinct detection sensitivity for out-of-distribution and degraded inputs.
3. Data-Dependent Spatial Temperature Scaling
UNO supplements global uncertainty with spatial temperature scaling via the TempNet auxiliary network. For each modality, TempNet predicts a 2D temperature map conditioned on : 0 The spatial temperature is applied to each logit before the softmax: 1 TempNet is trained by minimizing negative log-likelihood over the per-pixel scaled softmax. This mechanism enables localized calibration, e.g., down-weighting fog-obscured distant regions, and captures region-specific uncertainty that global metrics may miss.
4. Uncertainty-Scaled Softmax and Deviation Ratios
The softmax input for each modality is dynamically sharpened or flattened by a deviation ratio 2, which aggregates the normalized difference between current and training-phase statistics of each uncertainty metric. For uncertainty measure 3,
4
where 5 and 6 are the mean and standard deviation on clean data, and 7 is the test-time value. The overall scaling factor is set as 8, a conservative strategy that discounts the expert’s prediction if any metric signals unpredictability. The per-class probabilities are thus: 9
5. Noisy-Or Fusion Rule
The final multimodal fusion is accomplished by the noisy-or rule, a probabilistic model in which each modality acts as an independent “cause” for class 0. The combined belief for class 1 is computed as: 2 In this framework, high confidence from any single expert is preserved (i.e., if one 3 is close to 1, the fused probability is high for that class), and weak or noisy experts do not dominate the decision. No fusion parameters are trained; the method is zero-parameter at the fusion stage.
6. Training Procedures
The base segmentation networks are trained conventionally to minimize pixel-wise cross-entropy on clean data per modality. TempNet is subsequently trained with a negative log-likelihood loss using ground-truth labels and the pixel-wise temperature-scaled logits: 4 Neither the uncertainty scaling nor the noisy-or fusion rule introduce any additional trainable parameters. All uncertainty statistics are estimated during training and applied during inference only.
7. Empirical Evaluation and Ablation Analysis
UNO was evaluated on AirSim-generated urban RGB-D datasets with two training conditions (fog level 0, fog level 50) and diverse test degradations including fog (level 100), snow, frost, motion blur, brightness shifts, masking, impulse noise, Gaussian noise, and shot noise. Performance was measured by mean Intersection-over-Union (mIoU).
Key results summarize comparative performance:
| Method | mIoU (avg. all test condit.) | mIoU (in-dist.) | mIoU (degraded) |
|---|---|---|---|
| RGB-D SSMA | ≈ 62.5% | >80% | Collapses on severe |
| UNO | ≈ 78.4% | >80% | ≫ Baseline |
| UNO++ (with SSMA) | ≈ 80.0% | >80% | ≫ Baseline |
UNO improved mIoU by +15.9 pts over SSMA across all test conditions, and UNO++ by +17.5 pts (Tian et al., 2019). In-distribution performance remained comparable to other methods, but UNO degraded gracefully in conditions unseen at training, avoiding the catastrophic failures noted in learned fusion baselines under, for example, input blackout or severe noise.
Ablation studies indicated that single-pass entropy (5) combined with spatial temperature scaling sufficed for robust uncertainty estimation; inclusion of MCDO-based metrics incurred computational overhead without commensurate performance gains. Choosing the minimum deviation ratio across uncertainty measures proved a conservatively effective detector for any input corruption.
8. Context, Significance, and Limitations
UNO’s principal advance is its ability to handle unscripted, out-of-distribution input corruptions without fixed, explicit modeling of each degradation type. Dynamic uncertainty-driven recalibration, rather than hard learned gates or fixed fusion strategies, enables greater generalization to novel sensor failures or adversarial degradations. The spatial temperature maps provide region-adaptive modulation, critical for handling spatially structured corruption (e.g., distance-dependent fog). Noisy-or fusion preserves the influence of highly confident experts while suppressing unreliable ones, outperforming multiplicative fusion schemes that dilute correct signals through averaging.
A plausible implication is that zero-parameter, uncertainty-aware fusion schemes may be broadly applicable for multimodal systems where training-time anticipation of every input perturbation is impractical. The method’s reliance on only pre-trained segmentation networks and a compact auxiliary TempNet ensures tractability and deployment efficiency.
Observed limitations include increased inference cost for Monte Carlo-based uncertainty estimation, though ablation finds deterministic entropy plus learned temperature sufficient for typical scenarios. Empirical results are demonstrated in a photo-realistic simulation environment; real-world generality depends on the similarity between simulated and physical sensor failure modes.
UNO establishes a rigorous, extensible probabilistic foundation for multimodal sensor fusion under uncertainty, emphasizing adaptivity and out-of-distribution robustness without fusion-stage re-training (Tian et al., 2019).