Local-Global Fusion Adapters (LGFA)

Updated 24 April 2026

Local-Global Fusion Adapters (LGFA) are architectural modules that explicitly merge local and global features using dual-path processing and cross-attention, leading to enhanced representational capacity.
They employ mechanisms such as bidirectional gating and selective fusion to integrate convolutional and Transformer-based operations, thereby improving data efficiency and performance in tasks like ASR, segmentation, and 3D detection.
Empirical evaluations show that LGFA models outperform traditional fusion baselines with optimized parameter usage and reduced client variance, demonstrating their effectiveness across diverse applications.

Local-Global Fusion Adapters (LGFA) constitute a class of architectural modules that enable explicit, trainable integration of local and global features within deep neural network backbones. These adapters operate across diverse modalities and paradigms—automatic speech recognition, medical image segmentation, multimodal learning, speech emotion recognition, and 3D object detection—facilitating improved representational capacity, data efficiency, and downstream task performance. Core to LGFA is the design of mechanisms for joint interaction and fusion between local (e.g., convolutional, windowed, spatial, or client-personalized) and global (e.g., self-attention, semantic, cross-client, or scene-level) representations, typically realized through cross-attention, gating, or dual-adapter structures.

1. Core Architectural Principles

A unifying paradigm in LGFA is the parallel (or dual-path) processing of input data through modules specialized for local and global feature mining, followed by an explicit fusion stage. Representative instantiations include:

InterFormer (Lai et al., 2023): Employs a two-branch encoder—one convolutional (local), one Transformer-based (global)—with Bidirectional Feature Interaction Modules (BFIM) exchanging information and a Selective Fusion Module (SFM) merging outputs at each block.
U-DFA (Sajjad et al., 1 Oct 2025): Integrates frozen DINOv2-based ViT (global) and CNN-based Spatial Pattern Adapter (SPA, local) streams, inserted with LGFA modules at multiple encoder stages for two-way cross-attention.
FedDLP (Nguyen et al., 10 Mar 2025): Utilizes per-layer parallel LoRA adapters—local (personalized) and global (shared)—with knowledge-distillation-driven inter-adapter fusion in a federated setting.
LGFA for SER (Lu et al., 2023): Realizes local-to-global feature aggregation by nesting a Frame Transformer (local) within a Segment Transformer (global), with adapters aligning feature dimensions for downstream fusion.
LoGoNet (Li et al., 2023): Enforces local-grid and global-scene interaction in LiDAR-camera fusion via deformable cross-attention and feature dynamic aggregation.

The following table summarizes major LGFA variants:

Application	Local Encoder	Global Encoder	Fusion Mechanism
Speech recognition	Conv	Transformer (MHSA)	BFIM + SFM
Medical image segmentation	CNN SPA	ViT (frozen)	Dual MHCA cross-attention
Federated learning	LoRA (private)	LoRA (global)	Distillation + aggregation
Speech emotion recognition	Frame Transformer	Segment Transformer	Additive feature adapters
3D object detection	Local grid/PIE	Global voxel/image	Cross-attention + FDA

2. Mathematical and Algorithmic Formulation

LGFA modules are mathematically instantiated via customized attention, gating, and aggregation operators:

Bidirectional Feature Interaction (BFIM) (Lai et al., 2023):
- Local-to-Global gating:
$\widetilde G = \mathrm{PWConv}(G) \odot \sigma(L)$ - Global-to-Local gating:

$\widetilde L = \mathrm{PWConv}(L) \odot \sigma(G)$ - Channelwise dynamic ReLU modulation enhances adaptivity in the convolutional branch.
Selective Fusion Module (SFM) (Lai et al., 2023):
- Concatenates and globally pools the two streams, applies channel excitation via MLP, then fuses features by channelwise softmax-weighted sum and squeeze-and-excitation post-processing.
Dual Cross-Attention LGFA (Sajjad et al., 1 Oct 2025):
- At each stage:
$f'_\mathrm{dino} = f^i_\mathrm{dino} + \mathrm{MHCA}( \mathrm{LN}( f^i_\mathrm{dino} ), \mathrm{LN}( f^i_\mathrm{spa} ), \mathrm{LN}( f^i_\mathrm{spa} ) )$

$f^{i+1}_\mathrm{spa} = f^i_\mathrm{spa} + \mathrm{MHCA}( \mathrm{LN}( f^i_\mathrm{spa} ), \mathrm{LN}( f^{i+1}_\mathrm{dino} ), \mathrm{LN}( f^{i+1}_\mathrm{dino} ) )$
Federated Dual Adapter (FedDLP) (Nguyen et al., 10 Mar 2025):
- Parameterized as LoRA adapters:
$h_l = z + A_l (B_l z), \quad h_g = z + A_g(B_g z)$ - Dual loss:

$\mathcal{L}_\text{local} = \mathcal{L}_{CE}(h_l) + \alpha D_{KL}(h_l || h_g), \quad \mathcal{L}_\text{global} = \mathcal{L}_{CE}(h_g) + D_{KL}(h_g || h_l)$ - Global adapter is aggregated across clients, local adapter is private and pruned.
Local-to-Global Segment/Frame Aggregation (Lu et al., 2023):
- Frame encoding:
$x'_i = FC(x_i) + e^f_i$ - Segment aggregation:

$s''_j = FC(\mathrm{Vec}(s_j)) + FC(\mathrm{Vec}(\hat{x}^{s_j}))$ - No learned weighting scheme; fusion is via additive adapters.
LoGoNet (Li et al., 2023):
- Deformable cross-attention fuses grid-projected image and point features, with position encoding and transformer-based dynamic aggregation.

3. Empirical Evaluation and Ablation

LGFA modules consistently outperform both serial and naïve feature fusion baselines across modalities.

ASR (InterFormer) (Lai et al., 2023): Achieves 4.4%/4.9% CER on Aishell-1, surpassing Conformer (4.6%/5.1%) and Transformer (6.0%/6.7%). Ablation shows both BFIM and SFM contribute ~0.1–0.2 pt CER/WER reduction over vanilla baselines.
Medical Segmentation (U-DFA) (Sajjad et al., 1 Oct 2025): Inserting three cross-attention LGFAs yields best Dice/HD trade-off (DSC 82.25, HD 15.27 on Synapse dataset), with only 33% trainable parameters. Fewer adapters under-utilize spatial context; more cause overfitting (increased HD).
Federated (FedDLP) (Nguyen et al., 10 Mar 2025): Yields higher mean accuracy and lowest client-variance across vision datasets, with communication cost reduction (2–4× vs. FLoRA/FedDAT). Dual local-global design outperforms merged or single-adapter under heterogeneity.
SER (Lu et al., 2023): Outperforms ViT, TNT, and CNN+LSTM on IEMOCAP (WAR/UAR: 73.29/62.63) and CASIA (49.75/49.75). Ablation confirms nesting is superior to single-level or non-overlapping chunk-based models.
3D Detection (LoGoNet) (Li et al., 2023): On Waymo test set, reaches 81.02 mAPH (L2) and sets a benchmark by surpassing 80 APH (L2) on all three classes, outperforming BEVFusion and other competitors. Combining GoF, LoF, and FDA modules yields 2.69–4.63 pt gains over CenterPoint RCNN backbone.

4. Implementation and Computational Considerations

LGFA designs are parameter-efficient and modular but introduce additional computational overhead primarily from dual-path branches and cross-attention routines.

Parameterization: LGFA modules can be restricted to a handful of insertions (e.g., three in U-DFA) accounting for a small fraction of total parameters (e.g., 14M of 46M).
Computation: Cross-attention cost is dominated by spatial token set size; careful stage placement and input resolution choice are essential for scalability (Sajjad et al., 1 Oct 2025).
Optimization: Most LGFA-based models freeze the main backbone (e.g., DINOv2), train only adapters and task-specific modules, and use standard optimizers such as AdamW or Adam.
Data Augmentation: Commonplace, with application-dependent schemes (e.g., SpecAugment in ASR, random flips/intensity jitter in imaging, speaker-wise partitioning in SER).

5. Mechanistic Insights and Utility

LGFA mechanisms enhance model expressivity by explicitly local-global integration at multiple abstraction levels:

Exchange Mechanisms: Bidirectional information flow (BFIM, dual cross-attention) allows each pathway to guide the other's representation, improving edge detection (segmentation), context disambiguation (SER), or client generalization (FL).
Fusion Schemes: Soft selection modules (SFM), cross-attention, and dynamic weighting avoid rigid assignments, tailoring feature contributions per instance and channel.
Task Specialization: Early fusion (e.g., spatial into ViT) improves fine detail preservation; late fusion (global into local) resolves semantic ambiguities or complements incomplete context (Sajjad et al., 1 Oct 2025).
Adapter Placement: Empirical results consistently indicate that a small number of strategically placed LGFAs outperforms full fusion at every layer; over-provisioning leads to minor overfitting (increased HD or marginal gains) (Sajjad et al., 1 Oct 2025).

6. Domain-Specific Adaptations

The LGFA framework is adaptable and has been tailored to the idiosyncrasies of multiple domains:

Speech/Audio: Frame/segment transformer nesting for capturing multi-timescale correlations (Lu et al., 2023); dynamic gating and channelwise reweighting in ASR (Lai et al., 2023).
Medical Imaging: Cross-attention fusion to inject CNN-derived spatial priors into semantic transformer maps, with decoupling of trainable and frozen stages (Sajjad et al., 1 Oct 2025).
Federated Multimodal: Dual adapter design separating private adaptation (local) from public aggregation (global), with on-device pruning for efficiency (Nguyen et al., 10 Mar 2025).
3D Perception: Cross-modal fusion at both proposal-global and grid-local levels, position-aware encoding, dynamic aggregation across fused features (Li et al., 2023).

7. Limitations and Future Prospects

Although LGFA modules demonstrate empirical superiorities, their complexity may impact model interpretability and resource efficiency at large scale. Cross-attention and gating can become computational bottlenecks for high-resolution inputs or large token sets. Further, optimal modular placement and fusion schedules are model- and data-dependent, requiring empirical tuning. In federated contexts, the dual-adapter design mitigates but does not eliminate challenges from extreme data heterogeneity.

A plausible future direction is the broader adoption of LGFA-style adapters in foundation backbone models, where fine-grained local-global fusion is critical but parameter or communication budgets are constrained. The framework's conceptual modularity and demonstrated cross-domain efficacy position it as a central motif in next-generation hybrid architectures.

Markdown Report Issue Upgrade to Chat

References (5)

InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition (2023)

U-DFA: A Unified DINOv2-Unet with Dual Fusion Attention for Multi-Dataset Medical Segmentation (2025)

Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency (2025)

Learning Local to Global Feature Aggregation for Speech Emotion Recognition (2023)

LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Local-Global Fusion Adapters (LGFA).

Local-Global Fusion Adapters (LGFA)

1. Core Architectural Principles

2. Mathematical and Algorithmic Formulation

3. Empirical Evaluation and Ablation

4. Implementation and Computational Considerations

5. Mechanistic Insights and Utility

6. Domain-Specific Adaptations

7. Limitations and Future Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Local-Global Fusion Adapters (LGFA)

1. Core Architectural Principles

2. Mathematical and Algorithmic Formulation

3. Empirical Evaluation and Ablation

4. Implementation and Computational Considerations

5. Mechanistic Insights and Utility

6. Domain-Specific Adaptations

7. Limitations and Future Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research