Multi-Level Global-Local Fusion (MGLF-Net)

Updated 18 November 2025

The paper introduces MGLF-Net, a multi-level architecture that synergistically fuses global (e.g., transformer-based) and local (CNN-based) features to enhance task-specific performance.
It employs a dual-backbone design with per-level global-local fusion blocks using cross-attention, achieving improvements such as SRCC up to 0.9039 and Dice scores of 82.29%.
The method's flexible fusion paradigm is validated across AIGC quality assessment, echocardiogram segmentation, and face recognition, suggesting potential for lightweight and multi-modal adaptations.

The Multi-Level Global-Local Fusion Network (MGLF-Net) refers to a class of deep neural architectures that perform joint fusion of globally and locally extracted features at multiple semantic levels for tasks requiring nuanced, multi-granular representations. MGLF-Net has been introduced in several domains, including blind perceptual quality assessment of AI-generated content (AIGC) images (Meng et al., 23 Jul 2025), multi-view echocardiogram video segmentation (Zheng et al., 2023), and face recognition (Yu et al., 2024), reflecting its architectural flexibility and universal fusion principle. The canonical design incorporates dual-branch feature extraction (e.g., transformer and CNN paths), per-level global/local contextual fusion, and a joint aggregation head.

1. Core Architecture and Modules

The defining trait of MGLF-Net is parallel extraction of global and local feature representations at multiple network depths, followed by fusion using dedicated blocks and joint aggregation for the downstream task.

AIGC Image Quality Assessment (Meng et al., 23 Jul 2025):

Dual Backbone: CLIP-B/32 transformer for global features and ResNet-50 for local features.
Multi-Level Extraction: Features {G¹…G⁴} from CLIP at layers 3/6/9/12; {L¹…L⁴} from ResNet-50 stages 1–4 (transformed by trainable adapters).
Global-Local Fusion (per level): At each level ℓ, a GLF block receives a learnable query set Q₀^ℓ and refines it through:
1. Cross-attention with global tokens: $Q^{\ell}_1 = \mathrm{CrossAttn}(Q^{\ell}_0, G^{\ell}, G^{\ell}) + Q^{\ell}_0$
2. Cross-attention with local tokens: $Q^{\ell}_2 = \mathrm{CrossAttn}(Q^{\ell}_1, L^{\ell}, L^{\ell}) + Q^{\ell}_1$
3. MLP-based nonlinearity: $Q^{\ell} = \mathrm{FFN}(Q^{\ell}_2) + Q^{\ell}_2$
Joint Aggregation: Concatenate outputs from all levels, apply global average pooling, and a regression head for final prediction.
Frozen Backbones: Only adapters, fusion, and regression head are trainable.

Multi-view Echocardiogram Video Segmentation (Zheng et al., 2023):

View-specific Encoders: Each view (PLVLA, LVSA, A4C) processes frames independently via DeeplabV3/ResNet-101.
Multi-view Global Fusion (MGFM): View-wise non-local attention block captures cross-view context, especially synchrony among cardiac cycles.
Multi-view Local Fusion (MLFM): Uses pseudo-segmentation masks and learned center-weighted spatiotemporal masks to direct attention on consistent anatomical regions before fusing via cross-view attention.
Global-Local Integration: Sum global and local fused outputs for each view before decoding into segmentation masks.

Face Recognition (Yu et al., 2024):

Global Path: Standard GAP over backbone output and projection.
Local Path: Multi-head, multi-scale local feature extractors use different kernel sizes and spatial/channel attention.
Adaptive Feature Fusion: Local/global feature energies computed by L2 norm; normalized and used to weight the adaptive sum for the final embedding.

2. Mathematical Formulation

Let $B$ denote batch size, $D$ the feature dimension, $N_{\ell}$ and $C_{\ell}$ the number of Transformer/CNN tokens per level $\ell$ , $N_o$ the number of learnable queries.

Per-level extraction:

Transformer: $F^{\ell}_T = G^{\ell} \in \mathbb{R}^{B \times N_{\ell} \times D}$
CNN: $F^{\ell}_C = L^{\ell} \in \mathbb{R}^{B \times C_{\ell} \times D}$

GLF Block (level $\ell$ ):

$\begin{aligned} Q^{\ell}_1 &= \mathrm{CrossAttn}(Q^{\ell}_0, G^{\ell}, G^{\ell}) + Q^{\ell}_0 \ Q^{\ell}_2 &= \mathrm{CrossAttn}(Q^{\ell}_1, L^{\ell}, L^{\ell}) + Q^{\ell}_1 \ Q^{\ell} &= \mathrm{FFN}(Q^{\ell}_2) + Q^{\ell}_2 \end{aligned}$

Joint aggregation:

$Q_{cat} = \mathrm{Concat}(Q^1, Q^2, Q^3, Q^4) \ F_{task} = \mathrm{GAP}(Q_{cat}) \ \mathrm{MOS}_{pred} = \mathrm{MLP}(F_{task})$

Input per view: $X_i \in \mathbb{R}^{C \times H \times W \times T}$

Encoder: $F_i \in \mathbb{R}^{D \times h \times w \times T}$
MGFM: View-wise non-local fusion on concatenated features
MLFM: Weighted local mask

$M_i = \sigma(\mathrm{pool}(\sigma(\hat{y}_i) \times \sigma(w_i)))$

Global-Local Integration:

$\widetilde{F}_i = \overline{F}^{(i)}_{global} + \overline{F}^{(i)}_{local}$

Loss function:

$L = L_{seg} + \alpha L_{cyc}$

3. Training Objectives and Protocols

Loss: Mean squared error for MOS regression:

$L = \frac{1}{B} \sum_{i=1}^B (\mathrm{MOS}_{pred}^{(i)} - y^{(i)})^2$

Optimization: AdamW, 1e-5 learning rate, 1e-5 weight decay, batch size 16, 30 epochs. Only Adapters, GLF blocks, and regression head trained (all backbones frozen).

Supervised segmentation: Sparse annotation on 5 frames per video; binary cross-entropy.
Dense cycle loss: Optimizes global features to enforce temporal cyclicity, leveraging cardiac rhythm across views.
Total loss: Weighted sum, parameter α set to 1 in all experiments.
Optimization: Adam, 3e-4 learning rate, batch size includes 8 labeled frames and 1 unlabeled clip, cosine annealing for 100 epochs.

4. Empirical Results and Ablations

Dataset	SRCC (Qual.)	PLCC (Qual.)	SRCC (Corr.)	PLCC (Corr.)
AGIQA-1K	0.8648	0.8874	—	—
AGIQA-3K	0.9039	0.9310	0.8410	0.8968
AIGCIQA2023	0.8499	0.8664	0.7764	0.7649

Ablations show that both Transformer and CNN features are individually effective, but the highest accuracy is reached with full multi-level global-local fusion. Removing either global or local features causes a drop in both SRCC and PLCC metrics. Decreasing the number of levels also reduces accuracy.

Method	Dice (%)
DeeplabV3 (single-view)	74.46
U-Net (single-view)	75.84
CSS (semi-sup)	78.83
GL-Fusion (MGLF-Net)	82.29

Per-view, GL-Fusion significantly improves over best single-view models, particularly in PLVLA (+10.49%).

Ablation studies confirm that MGFM and MLFM each contribute substantial gains; both together with the dense cycle loss yield the best result (82.29%).

5. Comparative Analysis Across Domains

The global-local fusion paradigm is highly generalizable. In AIGC IQA, the dual-path, per-level fusion enables robust semantic-aware regression, outperforming single-level or single-path baselines. In medical video segmentation, coordinated MGFM and MLFM modules effectively leverage anatomical structure and temporal regularities. For face recognition, local/global feature quality assessment and adaptive weighting directly improve robustness to masked or deformed regions (Yu et al., 2024).

A key insight is that explicit architectural modeling of both local and global contexts—integrated at multiple depths—matches or outperforms bespoke attention-based or naive fusion networks across tasks. Both ablation and performance results indicate that neglecting either the local path (fine-grained structures) or global context (semantic coherence) degrades performance.

6. Limitations and Directions for Future Research

Current MGLF-Net designs presuppose access to high-capacity global and local feature extractors and sufficient computational resources, as all reviewed implementations use frozen, high-parameter backbones (e.g., ResNet, CLIP). The fusion blocks themselves add further trainable parameters.

A plausible implication is that future research may explore lightweight MGLF-Net variants for resource-constrained settings, or extend the paradigm to additional modalities and self-supervised tasks. In the medical domain, mechanisms for improved interpretability and quantification of uncertainty in multi-level fusions remain open areas. The general fusion templates of MGLF-Net could be adopted or adapted for domains where multi-scale and multi-source context synchronization is critical.

7. Summary and Significance

The Multi-Level Global-Local Fusion Network (MGLF-Net) constitutes a unified architectural principle for synergistically integrating global and local features at multiple abstraction levels, substantially enhancing representation learning in tasks requiring both global context understanding and local detail preservation. Empirical results in AIGC image quality assessment, cardiac image segmentation, and face recognition demonstrate its efficacy and flexibility, establishing a broadly applicable design backbone for future multi-source and multi-scale fusion networks (Meng et al., 23 Jul 2025, Zheng et al., 2023, Yu et al., 2024).

PDF Markdown Chat (Pro)

References (3)

Hierarchical Fusion and Joint Aggregation: A Multi-Level Feature Representation Method for AIGC Image Quality Assessment (2025)

GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video Segmentation (2023)

Local and Global Feature Attention Fusion Network for Face Recognition (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Level Global-Local Fusion Network (MGLF-Net).

Multi-Level Global-Local Fusion (MGLF-Net)

1. Core Architecture and Modules

2. Mathematical Formulation

AIGC Image Quality Assessment (Meng et al., 23 Jul 2025)

Multi-view Video Segmentation (Zheng et al., 2023)

3. Training Objectives and Protocols

AIGC IQA (Meng et al., 23 Jul 2025)

Echocardiogram Segmentation (Zheng et al., 2023)

4. Empirical Results and Ablations

AIGC IQA (Meng et al., 23 Jul 2025)

Multi-view Echocardiogram (Zheng et al., 2023)

5. Comparative Analysis Across Domains

6. Limitations and Directions for Future Research

7. Summary and Significance

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Multi-Level Global-Local Fusion (MGLF-Net)

1. Core Architecture and Modules

2. Mathematical Formulation

AIGC Image Quality Assessment (Meng et al., 23 Jul 2025)

Multi-view Video Segmentation (Zheng et al., 2023)

3. Training Objectives and Protocols

AIGC IQA (Meng et al., 23 Jul 2025)

Echocardiogram Segmentation (Zheng et al., 2023)

4. Empirical Results and Ablations

AIGC IQA (Meng et al., 23 Jul 2025)

Multi-view Echocardiogram (Zheng et al., 2023)

5. Comparative Analysis Across Domains

6. Limitations and Directions for Future Research

7. Summary and Significance

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics