AC-BiFPN: Augmented Feature Pyramid Network

Updated 12 October 2025

The paper demonstrates that AC-BiFPN improves detection accuracy by merging multi-scale features using learnable fusion weights and enhanced convolutional modules.
It introduces convolutional feature enhancement modules that combine dilated, deformable, and standard convolutions to enrich detail while preserving contextual information.
Attention strategies via BiFormer Attention dynamically focus on salient regions, yielding significant improvements in precision across small, medium, and large object detection.

The Augmented Convolutional Bi-directional Feature Pyramid Network (AC-BiFPN) is a specialized neural feature fusion architecture designed for multi-scale object detection, with a particular focus on extracting detailed and contextual information from complex images. AC-BiFPN builds upon the fundamental Bi-directional Feature Pyramid Network (BiFPN) by integrating advanced convolutional operators and attention mechanisms to address scale variation, noise, and occlusion challenges in domains such as medical image analysis and maritime surveillance.

1. Architectural Foundations

AC-BiFPN operates as an encoder backbone in hybrid and end-to-end pipelines, most notably in tasks requiring precise localization and semantic-rich representations. Its core function is to process input images at multiple resolutions and fuse the resulting feature maps in a bi-directional manner, supporting both top-down and bottom-up aggregation pathways. The architecture accepts images resized to various scales, extracts features through hierarchical and parallel convolutional modules, and fuses these maps, ultimately producing a deep multi-scale representation optimized for downstream tasks. Feature fusion employs learnable weights for each pathway, formalized as:

$P_l^O = \frac{\sum_j w_{lj} \cdot U_j}{\sum_j w_{lj} + \epsilon}$

where $w_{lj}$ are fusion weights, $U_j$ are incoming feature maps (possibly resized or attention-refined), and $\epsilon$ stabilizes the denominator. This system generalizes across levels of the feature pyramid and supports attention integration at critical fusion nodes.

2. Convolutional Feature Enhancement

A key innovation of AC-BiFPN is the integration of multi-branch convolutional feature enhancement modules (CFE), positioned directly after backbone extractions. The CFE module addresses limitations in shallow (lack of context) and deep (loss of fine detail) layers by constructing a parallel multi-pathway convolutional system. It leverages standard, dilated, and deformable convolutions with various kernel shapes to adaptively expand receptive fields and sample diverse semantic cues. Three illustrative branches are:

$B_1 = DConv_{3 \times 3} \left\langle Conv_{3 \times 1}\{Conv_{1 \times 3}[Conv_{1 \times 1}(F)]\} \right\rangle$
$B_2 = DConv_{3 \times 3} \left\langle Conv_{5 \times 1}\{Conv_{1 \times 5}[Conv_{1 \times 1}(F)]\} \right\rangle$
$B_3 = DFConv_{3 \times 3} \left\langle Conv_{1 \times 3}\{Conv_{3 \times 1}[Conv_{1 \times 1}(F)]\} \right\rangle$

Outputs are concatenated and fused residually with the input:

$Y = \text{Concat}(B_1, B_2, B_3) + Conv_{1 \times 1}(F)$

This multi-scale enrichment ensures preservation of critical details required for small object detection and compensates for context loss at lower levels.

3. Attention-Based Feature Fusion

AC-BiFPN incorporates advanced attention strategies during feature pyramid fusion, specifically via BiFormer Attention (BA). BA implements a bi-level dynamic sparse attention mechanism, subdivided into three stages:

Region Partitioning and Projection: The feature map $F \in \mathbb{R}^{H \times W \times C}$ is divided into $S \times S$ regions, each linearly projected to queries, keys, and values.
Region-to-Region Attention: Mean-pooled region queries and keys construct an affinity graph, selecting top- $K$ connections for each region.
Token-to-Token Attention: Aggregated keys and values from relevant regions provide inputs for scaled dot-product attention, producing region-refined output embeddings.

The attention operation is formalized as:

$BA(F) = \text{Softmax} \left( \frac{Q_r K_g^T}{\sqrt{C}} \right) V_g + LCE(V_r)$

where $K_g$ , $V_g$ are aggregated keys and values, and $LCE$ denotes local-context embedding. This mechanism enables adaptive focusing on salient image regions and enhances cross-scale context modeling, vital for precise detection under noise and occlusion.

4. Performance Evaluation

Applied to diverse tasks, AC-BiFPN demonstrates marked improvements over conventional CNN-based models. On SSDD, the framework achieves an average precision (AP) of ~0.702, outperforming 25 state-of-the-art detectors. Notable gains are observed in small object detection (AP $_S$ = 0.698), medium (AP $_M$ = 0.733), and large targets (AP $_L$ = 0.701). Ablation studies reveal that:

CFE alone improves AP via enriched representation.
BA alone delivers superior refinement.
The combined AC-BiFPN yields a 19% AP gain, 24% AP $_S$ , 11.7% AP $_M$ , and 31% AP $_L$ compared to Faster R-CNN + FPN.

In medical imaging, integration with a Transformer decoder further enhances report generation performance, with BLEU-1 $\approx$ 38.2, METEOR $\approx$ 17.0, ROUGE $\approx$ 31.0, CIDEr $\approx$ 45.8 on the RSNA Intracranial Hemorrhage Detection dataset. These metrics signify improved diagnostic accuracy and text coherence relative to non-pyramid CNN backbones (Bouslimi et al., 9 Oct 2025).

5. Applications

AC-BiFPN’s multi-scale fusion capacity and attention integration target domains with challenging visual landscapes. Key applications include:

Domain	Task	Motivation
Maritime	Ship detection in SAR images	Robust detection under clutter and scale variation
Medical	Radiology report generation	Accurate small anomaly detection, report coherence
Environmental	Coastal monitoring	Adaptation to varying resolution and noise

It supports real-time monitoring, search and rescue, illegal activity detection, clinical decision support, and educational platforms for trainees via automated feedback (Meng et al., 18 Jun 2025, Bouslimi et al., 9 Oct 2025).

6. Significance and Implications

The AC-BiFPN framework consolidates convolutional feature enhancement and dynamic attention fusion, yielding superior cross-scale modeling and adaptability compared to static pyramid designs and non-attentive methods. This suggests improvements in robustness, sensitivity, and scalability for both small and large target detection. Its performance in clinical and surveillance settings substantiates its potential to streamline workflows and reduce critical diagnostic delays.

A plausible implication is that further development of AC-BiFPN-inspired architectures may generalize to additional domains requiring granular feature extraction under adverse imaging conditions, as well as enhance multi-modal feature fusion mechanisms beyond current paradigms.

PDF Markdown Chat (Pro)

References (2)

AI-Driven Radiology Report Generation for Traumatic Brain Injuries (2025)

Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images (2025)

Follow Topic

Get notified by email when new papers are published related to AC-BiFPN.