Local-Global Feature Fusion (LGFF)
- Local-Global Feature Fusion (LGFF) is a neural network paradigm that combines local fine-grained feature extraction with global context-aware representations.
- It employs parallel, serial, or hierarchical architectures using CNNs and Transformers to overcome feature incompleteness and resolution issues.
- LGFF improves model performance in tasks such as image segmentation, speech recognition, and EEG analysis by boosting robustness and generalization.
Local-Global Feature Fusion (LGFF) is a class of neural network design strategies that integrates localized, fine-grained feature extraction with holistic, context-aware representations. The paradigm addresses feature incompleteness and resolution issues typical of sequential CNN or Transformer pipelines by explicitly fusing representations at both local and global scales within dedicated modules or through hybrid architectures. In various domains—including computer vision, speech, medical imaging, remote sensing, and EEG analysis—LGFF improves discriminability, robustness, and generalization by ensuring that both network-wide context and segment-level details are captured and synergistically exploited.
1. Core Architectural Principles of LGFF
LGFF systematically combines local feature extraction—typically associated with convolutional operations (CNNs), fine-grained descriptors, or region-of-interest processing—with global feature encoders such as Transformers, global pooling, or graph-theoretic summarizations. Fusion occurs either serially (stacking local and global modules), in parallel (processing and then merging both streams), or hierarchically across multiple levels.
Typical LGFF module designs are exemplified by:
- Parallel or Branched Architectures: Separate local and global branches, e.g., as in BoNet+ for bone age assessment (parallel CNN+Transformer streams) (Lou et al., 20 Dec 2025), HiPerformer for segmentation (CNN and Transformer features fused per stage) (Tan et al., 24 Sep 2025), and many image quality or face recognition models.
- Within-block or Intra-block Fusion: E.g., ERes2Net implements local fusion within a residual block (across channel groups) and global fusion across blocks, both using a shared lightweight attention fusion core (Chen et al., 2023).
- Explicit Attention-based Fusion: Attentional Feature Fusion (AFF), feature-quality–driven attention weights (Yu et al., 2024), and cross-attention modules (Meng et al., 23 Jul 2025).
Fusion operations range from simple concatenation or summation to complex attention-weighted, query-guided, or non-linear gating mechanisms.
2. Mathematical Foundations and Module Mechanics
The technical realization of LGFF leverages channel-wise, spatial, or temporal operations for both streams:
| Fusion Aspect | Example Formulation/Module | Representative Papers |
|---|---|---|
| Local fusion (intra-module) | (Chen et al., 2023) | |
| Global fusion (cross-stage/scale) | (downsample+attentional gating) | (Chen et al., 2023) |
| Attention fusion | , | (Chen et al., 2023) |
| Cross-attention (hierarchical fusion) | , etc. | (Meng et al., 23 Jul 2025) |
| Feature-quality–driven fusion | with weights from energy | (Yu et al., 2024) |
Local modules often use grouped/dilated convolutions, temporal aggregation, dense patch extraction, or graph-based descriptors. Global modules rely on transformer self-attention, multi-scale pooling, or long-range convolutional attention. Fusion is achieved via channel, pixel/voxel, or token space integration, gated by adaptive attention or data-driven coefficients.
3. LGFF Across Domains: Model Variants and Task Adaptations
Computer Vision (Image and Video)
- Fine-grained Classification: Patchwise-LSTM for local features, multi-scale LBP or graph-based features as global summary; fusion via simple concatenation or FC layers improves especially in texture or small object discrimination tasks (Bera et al., 2023, Huang, 2021).
- Image Quality Assessment and Synthesis Detection: Hierarchical query-based fusion of ResNet (local) and Transformer (global) backbones with joint-level cross-attention blocks (Meng et al., 23 Jul 2025), attention-based multi-scale and local patch embeddings (Ju et al., 2022).
- Semantic and Medical Image Segmentation: Joint use of single-head (global) self-attention and multi-scale (local) dilated/depthwise convolutions, often paired with dynamic upsampling or adaptive residual MLP blocks for spatial precision and boundary accuracy (Zhao et al., 16 Sep 2025, Tan et al., 24 Sep 2025).
Speaker Verification and Speech Recognition
- Speaker Verification: ERes2Net with LGFF provides attentional fusion both inside each residual block (LFF) and across stage outputs (GFF), leading to substantial EER reduction (Chen et al., 2023).
- ASR: InterFormer employs parallel CNN and transformer streams per block, bidirectionally gating information (L2G, G2L), and a selective fusion module dynamically learns to allocate representational capacity (Lai et al., 2023).
Sensor Fusion and Remote Perception
- LiDAR–Vision Fusion for Odometry: Local-to-global feature fusion with bi-directional structure alignment, involving pixel/point clustering and adaptive gating, aligns dense visual cues with sparse LiDAR geometry (Liu et al., 2024).
- Point Cloud Segmentation: Multi-branch local encoding and global context pooling, followed by deep feature–guided attention between streams for contour-discriminative representations (Chen et al., 12 Oct 2025).
Biomedical Applications
- EEG Emotion Recognition: Local branch processes channel-wise entropy and connectivity features; global branch summarizes trial-level time/spectral/fractal signatures; fused via transformer attention and subject-adversarial regularization for cross-subject robustness (Zhou et al., 13 Jan 2026).
- Bone Age Assessment: Dual streams (Transformer for global, RFAConv for local) with channel concatenation, confirming ∼10–13% MAE reduction over single-stream baselines (Lou et al., 20 Dec 2025).
4. Empirical Benefits and Quantitative Impact
Ablation studies across multiple architectures consistently demonstrate that integrating local and global streams outperforms either stream alone. Representative empirical outcomes include:
| Task/Domain | Baseline | +Local | +Global | LGFF (Full) | Metric | Reference |
|---|---|---|---|---|---|---|
| Speaker Verification | EER 1.51% (Res2Net) | 1.04% | 1.33% | 0.92% | EER (VoxCeleb1-O) | (Chen et al., 2023) |
| 2-Stream Bone Age | – | 4.22 mo | 4.38 mo | 3.81 mo | MAE (RSNA) | (Lou et al., 20 Dec 2025) |
| HiPerformer Segmentation | 82.83% | – | – | 83.93% | DSC (Synapse) | (Tan et al., 24 Sep 2025) |
| Gait Recognition | 88.8% (prior SOTA) | – | – | 91.8% | Rank-1 (CASIA-B) | (Lin et al., 2020) |
| EEG Emotion Recognition | 36.4% | 34.6% | – | 40.1% | Acc. (LOSO) | (Zhou et al., 13 Jan 2026) |
These gains reflect consistent improvements in fine-grained classification, segmentation accuracy (boundary/small objects), cross-domain/person generalization, robustness to noise/deformation, and discriminative representational power.
5. LGFF Module Design Taxonomy
Distinct LGFF implementations reflect varying task needs:
- Attention-driven Fusion: Utilized in ERes2Net (AFF) (Chen et al., 2023), face recognition (feature-quality attention) (Yu et al., 2024), image detection (multi-head attention fusion) (Ju et al., 2022), and AIGC IQA (query-based cross-attention) (Meng et al., 23 Jul 2025).
- Hierarchical/Multi-level Fusion: Multi-stage or multi-resolution integration, as used in RoadFormer with stacked local-mix-global blocks (Wang et al., 3 Jun 2025), MGLF-Net (Meng et al., 23 Jul 2025), or in gait/voxel-wise recognition (Lin et al., 2020).
- Patch/Region-based Fusion: Patch selection based on keypoints, ROI, or activation, followed by per-patch CNNs and fusion (Suh et al., 2020, Ju et al., 2022).
- Hybrid CNN/Transformer Frameworks: Serial or interleaved hybridization (ConvBlocks with TransBlocks as in RoadFormer (Wang et al., 3 Jun 2025); SHDCBlock combining attention and depthwise convolution (Zhao et al., 16 Sep 2025)).
6. Limitations and Open Research Questions
While LGFF has achieved considerable success, key challenges remain:
- Computational Overhead: Dual-stream or multi-branch architectures increase parameter count and computational load, potentially challenging real-time or edge deployments (Zhao et al., 16 Sep 2025, Ju et al., 2022).
- Feature Alignment: Heterogeneous modalities, such as vision-LiDAR or EEG channels, require non-trivial alignment and adaptation layers, with the risk of distributional mismatch.
- Learnable vs. Heuristic Fusion: Many fusion operators are heuristic (concatenation, summation), and attention-based modules may suffer from trainability or overfitting in low-data regimes.
- Interpretability: Understanding which features are emphasized at local vs. global levels, and the mechanisms by which fusion impacts discrimination, is underexplored.
Some models address alignment by bi-directional structure matching (clustering/projecting between image and point cloud, or compensating for feature quality (Liu et al., 2024, Yu et al., 2024)). A plausible implication is that future LGFF designs will increasingly incorporate differentiable, data-driven alignment or weighting mechanisms to enhance generalizability.
7. Representative Implementations and Task-specific Adaptations
Selective instantiations highlight LGFF flexibility:
- Speaker Verification (ERes2Net): AFF replaces naive addition/concat for both within-block and across-block fusion, achieving a 39% EER reduction (Chen et al., 2023).
- LiDAR–Camera Fusion (DVLO): Alternating local (pixel clustering) and global (pseudo-image fusion) structure alignment bi-directionally bridges sparse and dense modalities (Liu et al., 2024).
- Domain Adaptive ReID (LF²): Teacher-student backbone, with a learnable Fusion Module re-weighting global features using local attention for robust unsupervised transfer (Ding et al., 2022).
- Shipping Label Inspection: YOLO-based region detection + FAST keypoints for salient patches, followed by late-stage fully connected fusion yields >3% accuracy gain over global-only (Suh et al., 2020).
- EEG Emotion Recognition: Dual-branch transformer attention over local (channel-wise, spectral-graph) and global (trial-level moment, spectral, multifractal) features yields ~40% LOSO accuracy (+4–6% over uni-modal baselines) (Zhou et al., 13 Jan 2026).
Task-specific adaptations often address data geometry, modality heterogeneity, or semantic granularity, confirming LGFF’s universality as a paradigm for generating discriminative representations.
References:
- ERes2Net and LGFF: (Chen et al., 2023)
- RoadFormer: (Wang et al., 3 Jun 2025)
- BoNet+: (Lou et al., 20 Dec 2025)
- DyGLNet: (Zhao et al., 16 Sep 2025)
- HiPerformer: (Tan et al., 24 Sep 2025)
- GLFF (AI-synth detection): (Ju et al., 2022)
- MGLF-Net (AIGC IQA): (Meng et al., 23 Jul 2025)
- Face Recognition LGAF: (Yu et al., 2024)
- Gait Recognition: (Lin et al., 2020)
- Deep LiDAR–Vision Fusion: (Liu et al., 2024)
- DAGLFNet (Point Cloud): (Chen et al., 12 Oct 2025)
- FGIC with Texture Fusion: (Bera et al., 2023), Texture CN fusion: (Huang, 2021)
- Shipping Label Inspection: (Suh et al., 2020)
- EEG Emotion Recognition: (Zhou et al., 13 Jan 2026)
- Unsupervised DA ReID: (Ding et al., 2022)
- InterFormer ASR: (Lai et al., 2023)