AC-BiFPN: Augmented Feature Pyramid Network
- The paper demonstrates that AC-BiFPN improves detection accuracy by merging multi-scale features using learnable fusion weights and enhanced convolutional modules.
- It introduces convolutional feature enhancement modules that combine dilated, deformable, and standard convolutions to enrich detail while preserving contextual information.
- Attention strategies via BiFormer Attention dynamically focus on salient regions, yielding significant improvements in precision across small, medium, and large object detection.
The Augmented Convolutional Bi-directional Feature Pyramid Network (AC-BiFPN) is a specialized neural feature fusion architecture designed for multi-scale object detection, with a particular focus on extracting detailed and contextual information from complex images. AC-BiFPN builds upon the fundamental Bi-directional Feature Pyramid Network (BiFPN) by integrating advanced convolutional operators and attention mechanisms to address scale variation, noise, and occlusion challenges in domains such as medical image analysis and maritime surveillance.
1. Architectural Foundations
AC-BiFPN operates as an encoder backbone in hybrid and end-to-end pipelines, most notably in tasks requiring precise localization and semantic-rich representations. Its core function is to process input images at multiple resolutions and fuse the resulting feature maps in a bi-directional manner, supporting both top-down and bottom-up aggregation pathways. The architecture accepts images resized to various scales, extracts features through hierarchical and parallel convolutional modules, and fuses these maps, ultimately producing a deep multi-scale representation optimized for downstream tasks. Feature fusion employs learnable weights for each pathway, formalized as:
where are fusion weights, are incoming feature maps (possibly resized or attention-refined), and stabilizes the denominator. This system generalizes across levels of the feature pyramid and supports attention integration at critical fusion nodes.
2. Convolutional Feature Enhancement
A key innovation of AC-BiFPN is the integration of multi-branch convolutional feature enhancement modules (CFE), positioned directly after backbone extractions. The CFE module addresses limitations in shallow (lack of context) and deep (loss of fine detail) layers by constructing a parallel multi-pathway convolutional system. It leverages standard, dilated, and deformable convolutions with various kernel shapes to adaptively expand receptive fields and sample diverse semantic cues. Three illustrative branches are:
Outputs are concatenated and fused residually with the input:
This multi-scale enrichment ensures preservation of critical details required for small object detection and compensates for context loss at lower levels.
3. Attention-Based Feature Fusion
AC-BiFPN incorporates advanced attention strategies during feature pyramid fusion, specifically via BiFormer Attention (BA). BA implements a bi-level dynamic sparse attention mechanism, subdivided into three stages:
- Region Partitioning and Projection: The feature map is divided into regions, each linearly projected to queries, keys, and values.
- Region-to-Region Attention: Mean-pooled region queries and keys construct an affinity graph, selecting top- connections for each region.
- Token-to-Token Attention: Aggregated keys and values from relevant regions provide inputs for scaled dot-product attention, producing region-refined output embeddings.
The attention operation is formalized as:
where , are aggregated keys and values, and denotes local-context embedding. This mechanism enables adaptive focusing on salient image regions and enhances cross-scale context modeling, vital for precise detection under noise and occlusion.
4. Performance Evaluation
Applied to diverse tasks, AC-BiFPN demonstrates marked improvements over conventional CNN-based models. On SSDD, the framework achieves an average precision (AP) of ~0.702, outperforming 25 state-of-the-art detectors. Notable gains are observed in small object detection (AP = 0.698), medium (AP = 0.733), and large targets (AP = 0.701). Ablation studies reveal that:
- CFE alone improves AP via enriched representation.
- BA alone delivers superior refinement.
- The combined AC-BiFPN yields a 19% AP gain, 24% AP, 11.7% AP, and 31% AP compared to Faster R-CNN + FPN.
In medical imaging, integration with a Transformer decoder further enhances report generation performance, with BLEU-1 38.2, METEOR 17.0, ROUGE 31.0, CIDEr 45.8 on the RSNA Intracranial Hemorrhage Detection dataset. These metrics signify improved diagnostic accuracy and text coherence relative to non-pyramid CNN backbones (Bouslimi et al., 9 Oct 2025).
5. Applications
AC-BiFPN’s multi-scale fusion capacity and attention integration target domains with challenging visual landscapes. Key applications include:
Domain | Task | Motivation |
---|---|---|
Maritime | Ship detection in SAR images | Robust detection under clutter and scale variation |
Medical | Radiology report generation | Accurate small anomaly detection, report coherence |
Environmental | Coastal monitoring | Adaptation to varying resolution and noise |
It supports real-time monitoring, search and rescue, illegal activity detection, clinical decision support, and educational platforms for trainees via automated feedback (Meng et al., 18 Jun 2025, Bouslimi et al., 9 Oct 2025).
6. Significance and Implications
The AC-BiFPN framework consolidates convolutional feature enhancement and dynamic attention fusion, yielding superior cross-scale modeling and adaptability compared to static pyramid designs and non-attentive methods. This suggests improvements in robustness, sensitivity, and scalability for both small and large target detection. Its performance in clinical and surveillance settings substantiates its potential to streamline workflows and reduce critical diagnostic delays.
A plausible implication is that further development of AC-BiFPN-inspired architectures may generalize to additional domains requiring granular feature extraction under adverse imaging conditions, as well as enhance multi-modal feature fusion mechanisms beyond current paradigms.