Papers
Topics
Authors
Recent
Search
2000 character limit reached

Local-Global Feature Fusion (LGFF)

Updated 15 April 2026
  • Local-Global Feature Fusion (LGFF) is a neural network paradigm that combines local fine-grained feature extraction with global context-aware representations.
  • It employs parallel, serial, or hierarchical architectures using CNNs and Transformers to overcome feature incompleteness and resolution issues.
  • LGFF improves model performance in tasks such as image segmentation, speech recognition, and EEG analysis by boosting robustness and generalization.

Local-Global Feature Fusion (LGFF) is a class of neural network design strategies that integrates localized, fine-grained feature extraction with holistic, context-aware representations. The paradigm addresses feature incompleteness and resolution issues typical of sequential CNN or Transformer pipelines by explicitly fusing representations at both local and global scales within dedicated modules or through hybrid architectures. In various domains—including computer vision, speech, medical imaging, remote sensing, and EEG analysis—LGFF improves discriminability, robustness, and generalization by ensuring that both network-wide context and segment-level details are captured and synergistically exploited.

1. Core Architectural Principles of LGFF

LGFF systematically combines local feature extraction—typically associated with convolutional operations (CNNs), fine-grained descriptors, or region-of-interest processing—with global feature encoders such as Transformers, global pooling, or graph-theoretic summarizations. Fusion occurs either serially (stacking local and global modules), in parallel (processing and then merging both streams), or hierarchically across multiple levels.

Typical LGFF module designs are exemplified by:

Fusion operations range from simple concatenation or summation to complex attention-weighted, query-guided, or non-linear gating mechanisms.

2. Mathematical Foundations and Module Mechanics

The technical realization of LGFF leverages channel-wise, spatial, or temporal operations for both streams:

Fusion Aspect Example Formulation/Module Representative Papers
Local fusion (intra-module) yi=Ki((Ui+1)xi+(1Ui)yi1)y_i=K_i((U_i+1)\odot x_i+(1-U_i)\odot y_{i-1}) (Chen et al., 2023)
Global fusion (cross-stage/scale) Fj=AFF(D(Sj1),Sj)F_j = \mathrm{AFF}(D(S_{j-1}), S_j) (downsample+attentional gating) (Chen et al., 2023)
Attention fusion AFF(A,B)=(U+1)A+(1U)B\mathrm{AFF}(A,B) = (U+1)\odot A + (1-U)\odot B, U=tanh(BN(W2SiLU(BN(W1[A,B]))))U=\tanh(\mathrm{BN}(W_2 \,\mathrm{SiLU}(\mathrm{BN}(W_1 [A,B])))) (Chen et al., 2023)
Cross-attention (hierarchical fusion) Qi(1)=CrossAttn(Qi,Gi,Gi)+Qi\mathbf{Q}_i^{(1)} = \mathrm{CrossAttn}(\mathbf{Q}_i, \mathbf{G}_i, \mathbf{G}_i)+\mathbf{Q}_i, etc. (Meng et al., 23 Jul 2025)
Feature-quality–driven fusion κi=γigΨi+γilΥi\kappa_i = \gamma^g_i \Psi_i + \gamma^l_i \Upsilon_i with weights from energy (Yu et al., 2024)

Local modules often use grouped/dilated convolutions, temporal aggregation, dense patch extraction, or graph-based descriptors. Global modules rely on transformer self-attention, multi-scale pooling, or long-range convolutional attention. Fusion is achieved via channel, pixel/voxel, or token space integration, gated by adaptive attention or data-driven coefficients.

3. LGFF Across Domains: Model Variants and Task Adaptations

Computer Vision (Image and Video)

  • Fine-grained Classification: Patchwise-LSTM for local features, multi-scale LBP or graph-based features as global summary; fusion via simple concatenation or FC layers improves especially in texture or small object discrimination tasks (Bera et al., 2023, Huang, 2021).
  • Image Quality Assessment and Synthesis Detection: Hierarchical query-based fusion of ResNet (local) and Transformer (global) backbones with joint-level cross-attention blocks (Meng et al., 23 Jul 2025), attention-based multi-scale and local patch embeddings (Ju et al., 2022).
  • Semantic and Medical Image Segmentation: Joint use of single-head (global) self-attention and multi-scale (local) dilated/depthwise convolutions, often paired with dynamic upsampling or adaptive residual MLP blocks for spatial precision and boundary accuracy (Zhao et al., 16 Sep 2025, Tan et al., 24 Sep 2025).

Speaker Verification and Speech Recognition

  • Speaker Verification: ERes2Net with LGFF provides attentional fusion both inside each residual block (LFF) and across stage outputs (GFF), leading to substantial EER reduction (Chen et al., 2023).
  • ASR: InterFormer employs parallel CNN and transformer streams per block, bidirectionally gating information (L2G, G2L), and a selective fusion module dynamically learns to allocate representational capacity (Lai et al., 2023).

Sensor Fusion and Remote Perception

  • LiDAR–Vision Fusion for Odometry: Local-to-global feature fusion with bi-directional structure alignment, involving pixel/point clustering and adaptive gating, aligns dense visual cues with sparse LiDAR geometry (Liu et al., 2024).
  • Point Cloud Segmentation: Multi-branch local encoding and global context pooling, followed by deep feature–guided attention between streams for contour-discriminative representations (Chen et al., 12 Oct 2025).

Biomedical Applications

  • EEG Emotion Recognition: Local branch processes channel-wise entropy and connectivity features; global branch summarizes trial-level time/spectral/fractal signatures; fused via transformer attention and subject-adversarial regularization for cross-subject robustness (Zhou et al., 13 Jan 2026).
  • Bone Age Assessment: Dual streams (Transformer for global, RFAConv for local) with channel concatenation, confirming ∼10–13% MAE reduction over single-stream baselines (Lou et al., 20 Dec 2025).

4. Empirical Benefits and Quantitative Impact

Ablation studies across multiple architectures consistently demonstrate that integrating local and global streams outperforms either stream alone. Representative empirical outcomes include:

Task/Domain Baseline +Local +Global LGFF (Full) Metric Reference
Speaker Verification EER 1.51% (Res2Net) 1.04% 1.33% 0.92% EER (VoxCeleb1-O) (Chen et al., 2023)
2-Stream Bone Age 4.22 mo 4.38 mo 3.81 mo MAE (RSNA) (Lou et al., 20 Dec 2025)
HiPerformer Segmentation 82.83% 83.93% DSC (Synapse) (Tan et al., 24 Sep 2025)
Gait Recognition 88.8% (prior SOTA) 91.8% Rank-1 (CASIA-B) (Lin et al., 2020)
EEG Emotion Recognition 36.4% 34.6% 40.1% Acc. (LOSO) (Zhou et al., 13 Jan 2026)

These gains reflect consistent improvements in fine-grained classification, segmentation accuracy (boundary/small objects), cross-domain/person generalization, robustness to noise/deformation, and discriminative representational power.

5. LGFF Module Design Taxonomy

Distinct LGFF implementations reflect varying task needs:

6. Limitations and Open Research Questions

While LGFF has achieved considerable success, key challenges remain:

  • Computational Overhead: Dual-stream or multi-branch architectures increase parameter count and computational load, potentially challenging real-time or edge deployments (Zhao et al., 16 Sep 2025, Ju et al., 2022).
  • Feature Alignment: Heterogeneous modalities, such as vision-LiDAR or EEG channels, require non-trivial alignment and adaptation layers, with the risk of distributional mismatch.
  • Learnable vs. Heuristic Fusion: Many fusion operators are heuristic (concatenation, summation), and attention-based modules may suffer from trainability or overfitting in low-data regimes.
  • Interpretability: Understanding which features are emphasized at local vs. global levels, and the mechanisms by which fusion impacts discrimination, is underexplored.

Some models address alignment by bi-directional structure matching (clustering/projecting between image and point cloud, or compensating for feature quality (Liu et al., 2024, Yu et al., 2024)). A plausible implication is that future LGFF designs will increasingly incorporate differentiable, data-driven alignment or weighting mechanisms to enhance generalizability.

7. Representative Implementations and Task-specific Adaptations

Selective instantiations highlight LGFF flexibility:

  • Speaker Verification (ERes2Net): AFF replaces naive addition/concat for both within-block and across-block fusion, achieving a 39% EER reduction (Chen et al., 2023).
  • LiDAR–Camera Fusion (DVLO): Alternating local (pixel clustering) and global (pseudo-image fusion) structure alignment bi-directionally bridges sparse and dense modalities (Liu et al., 2024).
  • Domain Adaptive ReID (LF²): Teacher-student backbone, with a learnable Fusion Module re-weighting global features using local attention for robust unsupervised transfer (Ding et al., 2022).
  • Shipping Label Inspection: YOLO-based region detection + FAST keypoints for salient patches, followed by late-stage fully connected fusion yields >3% accuracy gain over global-only (Suh et al., 2020).
  • EEG Emotion Recognition: Dual-branch transformer attention over local (channel-wise, spectral-graph) and global (trial-level moment, spectral, multifractal) features yields ~40% LOSO accuracy (+4–6% over uni-modal baselines) (Zhou et al., 13 Jan 2026).

Task-specific adaptations often address data geometry, modality heterogeneity, or semantic granularity, confirming LGFF’s universality as a paradigm for generating discriminative representations.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Local-Global Feature Fusion (LGFF).