Multi-Level Global-Local Fusion (MGLF-Net)
- The paper introduces MGLF-Net, a multi-level architecture that synergistically fuses global (e.g., transformer-based) and local (CNN-based) features to enhance task-specific performance.
- It employs a dual-backbone design with per-level global-local fusion blocks using cross-attention, achieving improvements such as SRCC up to 0.9039 and Dice scores of 82.29%.
- The method's flexible fusion paradigm is validated across AIGC quality assessment, echocardiogram segmentation, and face recognition, suggesting potential for lightweight and multi-modal adaptations.
The Multi-Level Global-Local Fusion Network (MGLF-Net) refers to a class of deep neural architectures that perform joint fusion of globally and locally extracted features at multiple semantic levels for tasks requiring nuanced, multi-granular representations. MGLF-Net has been introduced in several domains, including blind perceptual quality assessment of AI-generated content (AIGC) images (Meng et al., 23 Jul 2025), multi-view echocardiogram video segmentation (Zheng et al., 2023), and face recognition (Yu et al., 25 Nov 2024), reflecting its architectural flexibility and universal fusion principle. The canonical design incorporates dual-branch feature extraction (e.g., transformer and CNN paths), per-level global/local contextual fusion, and a joint aggregation head.
1. Core Architecture and Modules
The defining trait of MGLF-Net is parallel extraction of global and local feature representations at multiple network depths, followed by fusion using dedicated blocks and joint aggregation for the downstream task.
AIGC Image Quality Assessment (Meng et al., 23 Jul 2025):
- Dual Backbone: CLIP-B/32 transformer for global features and ResNet-50 for local features.
- Multi-Level Extraction: Features {G¹…G⁴} from CLIP at layers 3/6/9/12; {L¹…L⁴} from ResNet-50 stages 1–4 (transformed by trainable adapters).
- Global-Local Fusion (per level): At each level ℓ, a GLF block receives a learnable query set Q₀ℓ and refines it through:
- Cross-attention with global tokens:
- Cross-attention with local tokens:
- MLP-based nonlinearity:
Joint Aggregation: Concatenate outputs from all levels, apply global average pooling, and a regression head for final prediction.
- Frozen Backbones: Only adapters, fusion, and regression head are trainable.
Multi-view Echocardiogram Video Segmentation (Zheng et al., 2023):
- View-specific Encoders: Each view (PLVLA, LVSA, A4C) processes frames independently via DeeplabV3/ResNet-101.
- Multi-view Global Fusion (MGFM): View-wise non-local attention block captures cross-view context, especially synchrony among cardiac cycles.
- Multi-view Local Fusion (MLFM): Uses pseudo-segmentation masks and learned center-weighted spatiotemporal masks to direct attention on consistent anatomical regions before fusing via cross-view attention.
- Global-Local Integration: Sum global and local fused outputs for each view before decoding into segmentation masks.
Face Recognition (Yu et al., 25 Nov 2024):
- Global Path: Standard GAP over backbone output and projection.
- Local Path: Multi-head, multi-scale local feature extractors use different kernel sizes and spatial/channel attention.
- Adaptive Feature Fusion: Local/global feature energies computed by L2 norm; normalized and used to weight the adaptive sum for the final embedding.
2. Mathematical Formulation
AIGC Image Quality Assessment (Meng et al., 23 Jul 2025)
Let denote batch size, the feature dimension, and the number of Transformer/CNN tokens per level , the number of learnable queries.
Per-level extraction:
- Transformer:
- CNN:
GLF Block (level ):
Joint aggregation:
Multi-view Video Segmentation (Zheng et al., 2023)
Input per view:
- Encoder:
- MGFM: View-wise non-local fusion on concatenated features
- MLFM: Weighted local mask
- Global-Local Integration:
- Loss function:
3. Training Objectives and Protocols
AIGC IQA (Meng et al., 23 Jul 2025)
- Loss: Mean squared error for MOS regression:
- Optimization: AdamW, 1e-5 learning rate, 1e-5 weight decay, batch size 16, 30 epochs. Only Adapters, GLF blocks, and regression head trained (all backbones frozen).
Echocardiogram Segmentation (Zheng et al., 2023)
- Supervised segmentation: Sparse annotation on 5 frames per video; binary cross-entropy.
- Dense cycle loss: Optimizes global features to enforce temporal cyclicity, leveraging cardiac rhythm across views.
- Total loss: Weighted sum, parameter α set to 1 in all experiments.
- Optimization: Adam, 3e-4 learning rate, batch size includes 8 labeled frames and 1 unlabeled clip, cosine annealing for 100 epochs.
4. Empirical Results and Ablations
AIGC IQA (Meng et al., 23 Jul 2025)
| Dataset | SRCC (Qual.) | PLCC (Qual.) | SRCC (Corr.) | PLCC (Corr.) |
|---|---|---|---|---|
| AGIQA-1K | 0.8648 | 0.8874 | — | — |
| AGIQA-3K | 0.9039 | 0.9310 | 0.8410 | 0.8968 |
| AIGCIQA2023 | 0.8499 | 0.8664 | 0.7764 | 0.7649 |
Ablations show that both Transformer and CNN features are individually effective, but the highest accuracy is reached with full multi-level global-local fusion. Removing either global or local features causes a drop in both SRCC and PLCC metrics. Decreasing the number of levels also reduces accuracy.
Multi-view Echocardiogram (Zheng et al., 2023)
| Method | Dice (%) |
|---|---|
| DeeplabV3 (single-view) | 74.46 |
| U-Net (single-view) | 75.84 |
| CSS (semi-sup) | 78.83 |
| GL-Fusion (MGLF-Net) | 82.29 |
Per-view, GL-Fusion significantly improves over best single-view models, particularly in PLVLA (+10.49%).
Ablation studies confirm that MGFM and MLFM each contribute substantial gains; both together with the dense cycle loss yield the best result (82.29%).
5. Comparative Analysis Across Domains
The global-local fusion paradigm is highly generalizable. In AIGC IQA, the dual-path, per-level fusion enables robust semantic-aware regression, outperforming single-level or single-path baselines. In medical video segmentation, coordinated MGFM and MLFM modules effectively leverage anatomical structure and temporal regularities. For face recognition, local/global feature quality assessment and adaptive weighting directly improve robustness to masked or deformed regions (Yu et al., 25 Nov 2024).
A key insight is that explicit architectural modeling of both local and global contexts—integrated at multiple depths—matches or outperforms bespoke attention-based or naive fusion networks across tasks. Both ablation and performance results indicate that neglecting either the local path (fine-grained structures) or global context (semantic coherence) degrades performance.
6. Limitations and Directions for Future Research
Current MGLF-Net designs presuppose access to high-capacity global and local feature extractors and sufficient computational resources, as all reviewed implementations use frozen, high-parameter backbones (e.g., ResNet, CLIP). The fusion blocks themselves add further trainable parameters.
A plausible implication is that future research may explore lightweight MGLF-Net variants for resource-constrained settings, or extend the paradigm to additional modalities and self-supervised tasks. In the medical domain, mechanisms for improved interpretability and quantification of uncertainty in multi-level fusions remain open areas. The general fusion templates of MGLF-Net could be adopted or adapted for domains where multi-scale and multi-source context synchronization is critical.
7. Summary and Significance
The Multi-Level Global-Local Fusion Network (MGLF-Net) constitutes a unified architectural principle for synergistically integrating global and local features at multiple abstraction levels, substantially enhancing representation learning in tasks requiring both global context understanding and local detail preservation. Empirical results in AIGC image quality assessment, cardiac image segmentation, and face recognition demonstrate its efficacy and flexibility, establishing a broadly applicable design backbone for future multi-source and multi-scale fusion networks (Meng et al., 23 Jul 2025, Zheng et al., 2023, Yu et al., 25 Nov 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free