Quality-Aware Dynamic Fusion Network (QDFNet)

Updated 3 January 2026

QDFNet is a multimodal, dynamic fusion architecture that adaptively weights input modalities based on data-driven quality assessments.
The design integrates specialized quality estimation modules with adaptive fusion techniques to mitigate noise, occlusion, and sensor degradation.
Empirical validations in domains like RGB-D saliency, remote sensing, and emotion analysis highlight its superior robustness and performance under adverse conditions.

Quality-Aware Dynamic Fusion Network (QDFNet) is a class of multimodal, quality-adaptive fusion architectures designed to dynamically regulate the contribution of each modality during feature integration, based on data-driven assessments of modality reliability or quality. QDFNet addresses the inherent variability and unreliability often present in multimodal data (e.g., due to noise, occlusion, resolution mismatch, or sensor degradation), deploying explicit quality estimation modules and adaptive fusion mechanisms to achieve robust cross-modal reasoning. Architectures termed “QDFNet” appear in diverse application domains, including RGB-D/thermal salient object detection, remote sensing object detection (Optical-SAR), and emotion analysis from dynamic video/audio/text signals (Chen et al., 2020, Bao et al., 2024, Yu et al., 13 Mar 2025, Zhao et al., 27 Dec 2025).

1. Motivations and Principles

Multimodal fusion frequently suffers when one or more modalities exhibit low quality—be it due to sensor noise (depth blur, fog, occlusion), registration errors, or missing data. Traditional fusion architectures relying on naïve aggregation (e.g., concatenation or early/late fusion without modality trust assessment) are prone to performance collapse in these scenarios. QDFNet is designed to mitigate this by:

Explicitly estimating data quality or reliability for each modality using dedicated subnetworks or token-based assessment modules.
Dynamically allocating fusion weights at either global or fine-grained (pixel/spatial/temporal) levels, thus modulating the influence of unreliable modalities.
Employing either soft gating (continuous reweighting) or hard region selection (masking, suppression) based on estimated quality maps.
Integrating dynamic fusion with multiscale, hierarchical attention or spatial/temporal modeling to leverage complementary cues without over-relying on any modality (Chen et al., 2020, Bao et al., 2024, Zhao et al., 27 Dec 2025).

These mechanisms yield increased robustness and adaptive performance, confirmed across controlled noise/occlusion ablations and real-world benchmarks.

2. Architectural Instantiations

QDFNet is instantiated differently across task domains, but with common schematic stages:

Domain	Modality Types	Quality Estimation	Adaptive Fusion Module	Reference
Salient Detection	RGB + Depth	DQA Subnet (pixelwise)	Feature-level ω-weighted fusion, MS decoder	(Chen et al., 2020)
VDT-SOD	RGB + Depth + Thermal	QA Region Subnets (mask)	Region-guided selective fusion, IIA, Edge Refinement	(Bao et al., 2024)
Optical-SAR Det.	Optical + SAR (missing-data)	DMQA (token-based)	OCNF (orthogonal, reliability-gated)	(Zhao et al., 27 Dec 2025)
Emotion Analysis	Video + Audio + Text	MLP score (temporal)	Softmax-weighted sum, temporal fusion	(Yu et al., 13 Mar 2025)

Common sequence:

Feature Extraction: Per-modality deep backbones (ResNet, ViT, Swin, CNN, etc.).
Quality Estimation: Subnet (CNN, MLP, or learnable tokens) infers reliability score(s) for each modality/region.
Dynamic Fusion: Reliability scores gate or weight the per-modality features in subsequent fusion.
Task-Specific Head: Detection, regression, or saliency prediction.

3. Quality Estimation Mechanisms

3.1. Pixel/Region-level Quality Maps

In RGB-D/thermal SOD (Chen et al., 2020, Bao et al., 2024), a dedicated DQA (and QA in triplet SOD) subnet generates a pixel-level quality map ω(x, y) ∈ [0, 1], interpreting each spatial location’s “contribution value”. This map is estimated via an encoder–decoder (e.g., VGG-19 or ResNet-34) trained using weak supervision from pseudo-GT:

$\text{For DQA:}\;\; M_i(x, y) = \omega(x, y)\cdot D_i^s(x, y) + (1 - \omega(x, y))\cdot RGB_i^s(x, y)$

Where $D_i^s$ , $RGB_i^s$ are side-output feature maps; $\omega$ gates the fusion spatially.

Pseudo-labels for QA are computed by comparing preliminary saliency predictions against GT, identifying “reliable” and “unreliable” regions for each modality (Bao et al., 2024).

3.2. Token-based Iterative Reliability (Optical-SAR)

For remote sensing with missing data (Zhao et al., 27 Dec 2025), the DMQA module employs $K$ learnable reference tokens per modality, which iteratively adjust to encode both feature alignment and expected magnitude:

At each iteration, features are checked against token-guided magnitude and cosine-similarity (“directional consistency”).
Reliability is constructed as $R_m^{(t)}(i) = \alpha L_m^{(t)}(i) + (1 - \alpha) D_m^{(t)}(i)$ for each spatial location $i$ and modality $m$ .
After $I$ iterations, final $R_m(i)$ is propagated to the fusion block to regulate the channel-wise weighting.

3.3. Temporal Quality Scores

In dynamic audio/video/text fusion (Yu et al., 13 Mar 2025), quality scores $q_m^t$ are produced by lightweight MLPs at each time $t$ :

$q_v^t = \text{MLP}_v(h_v^t), \quad q_a^t = \text{MLP}_a(h_a^t), \quad q_t^t = \text{MLP}_{text}(f_s)$

Softmax over these scores yields fusion weights $w_m^t$ , enabling per-frame adaptivity.

4. Dynamic Fusion Modules

The quality-guided fusion stage translates estimated reliability into adaptive aggregation:

Feature-level Fusion (RGB-D SOD):

$M_i(x, y) = \omega(x, y)D_i^s(x, y) + (1-\omega(x, y))RGB_i^s(x, y)$

Region-guided Selective Fusion (VDT SOD):

For depth-guided fusion at scale $i$ :

$W_{vd}^{i, 1} = (F^d_{i,1} - (F^v_{i,1} + F^t_{i,1})/2) \cdot P_{d}^{QA},\; W_{vd}^{i,2} = F^v_{i,1} \cdot (1 - P_{d}^{QA})$

With subsequent intra- and inter-attention modules for multistream feature interaction (Bao et al., 2024).

OCNF (Orthogonal Constraint Normalization Fusion) (Optical-SAR):

Features are projected into orthogonal subspaces, then fused as

$F_{\text{fused}} = \gamma_{R}\odot F'_R + \gamma_S \odot F'_S$

where $\gamma_{R, S}$ are channel-wise normalized reliability scores (Zhao et al., 27 Dec 2025).

5. Loss Functions and Training Paradigms

QDFNet frameworks typically employ multi-phase training:

Task Loss: Cross-entropy for segmentation/saliency, regression (MSE) for continuous labels, detection losses (classification, localization, objectness) for remote sensing (Chen et al., 2020, Yu et al., 13 Mar 2025, Zhao et al., 27 Dec 2025).
Quality Supervision: Binary cross-entropy on DQA/QA subnet quality maps using pseudo-GT to enforce accurate trust-region localization (Bao et al., 2024).
Auxiliary Losses: Weak supervision on edge maps, attention regularization, or orthogonality constraints as in OCNF.
Joint Fine-tuning: End-to-end optimization over the fusion pipeline, with frozen or partially adaptable feature extractors depending on stage.

6. Experimental Validation and Empirical Findings

Empirical results consistently demonstrate that QDFNet architectures outperform static fusion and quality-unaware baselines, particularly under adverse or missing-modality conditions:

Salient Object Detection: On benchmarks such as NJUDS, NLPR, STEREO (with degraded depth), QDFNet achieves higher S-measure, F-measure, and lower MAE than traditional bi-stream or simple-add fusion. Quality modulation prevents collapse under poor depth (Chen et al., 2020).
VDT Saliency: On VDT-2048, QDFNet (QSF-Net) achieves S-measure improvement ( $0.9426\to0.9379$ drop when QA subnet ablated), and exhibits resilience in high-degradation scenarios (Bao et al., 2024).
Optical-SAR Detection: QDFNet maintains highest [email protected] (e.g., 76.6 at MR=0.3) and mAP (38.3 at MR=0.3) under synthetic missing-modalities, and OCNF/DMQA blocks exhibit additive improvements over either alone (Zhao et al., 27 Dec 2025).
Emotion Estimation: On Hume-Vidmimic2, QDFNet secures notable Pearson correlation gains (0.3626 tri-modal vs. 0.25 best baseline) and demonstrates limited performance drop under occlusion or noise—5% for dynamic fusion vs. 20% for static (Yu et al., 13 Mar 2025).

Ablation studies for QA subnetworks, dynamic weighting, and attention modules consistently validate their necessity.

7. Domain-Specific Implementations and Comparison

QDFNet’s core principle—dynamically weighting modalities/features according to estimated quality—proves adaptable across problem settings:

Remote Sensing (Optical-SAR): Introduction of token-based reliability and orthogonality-preserving fusion counters challenges in missing-modality, all-weather detection (Zhao et al., 27 Dec 2025).
RGB-D/VDT Saliency: Fine-grained spatially-varying fusion via DQA/QA subnets produces robust maps in scenes with unreliable depth/thermal modalities (Chen et al., 2020, Bao et al., 2024).
Emotion Recognition: Temporal quality inferencing and contrastive-aligned embedding support frame-resolved, multimodal interpretability (Yu et al., 13 Mar 2025).

QDFNet outperforms non-quality-aware and prior SOTA methods in every published evaluation, particularly as the severity of multimodal degradation increases.

QDFNet thus constitutes a principled, empirically validated family of architectures for robust multimodal inference under input quality variability, unifying dynamic reliability estimation and adaptive fusion in end-to-end deep networks. Its variant modules (DQA, QA, DMQA, OCNF, MLP-based gating) offer a toolkit for cross-domain application, extending from saliency and emotion to sensor fusion and object detection in challenging, real-world conditions.