BiFPN: Adaptive Multi-Scale Feature Fusion
- BiFPN is a neural module that fuses multi-scale features bidirectionally, enhancing representation for detection and segmentation tasks.
- It employs iterative top-down and bottom-up passes with learnable weighted fusion, reducing parameters and computational costs.
- BiFPN has been adapted in various domains, from object detection to audio event localization and multi-sensor fusion applications.
A Bidirectional Feature Pyramid Network (BiFPN) is a neural architecture module designed for efficient and adaptive multi-scale feature fusion in deep networks, particularly in detection and segmentation pipelines. BiFPN extends standard feature pyramid approaches by introducing iterative, learnable, bidirectional cross-scale connections, enabling precise and computationally efficient aggregation of information from different spatial resolutions. The concept is formally introduced in "EfficientDet: Scalable and Efficient Object Detection" (Tan et al., 2019) and has been widely adopted and extended in recent object, audio, and multi-modal detection research.
1. Topology and Fusion Principles
BiFPN operates on a set of feature maps at multiple spatial scales, typically denoted , where each is a feature map of a particular resolution derived from a backbone network (e.g., EfficientNet, ResNet, GhostNet). The canonical BiFPN layer is structured as two sequential passes:
- Top-Down Pass: Propagates semantically strong features from low-resolution (deeper, coarser) levels to higher-resolution (shallower) levels. At each level , the upsampled output from is fused with the original using weighted combinations.
- Bottom-Up Pass: Aggregates spatial detail from fine scales up to coarser levels, fusing the top-down output, the lateral backbone features, and lower-level bottom-up outcomes. Downsampling is performed as necessary.
Each fusion node receives two or more inputs (never a single input, as such nodes are omitted for efficiency), with architectural recursion and skip-connections facilitating direct lateral and residual flows (Tan et al., 2019, Tang et al., 2024).
2. Learnable Weighted Feature Fusion
BiFPN introduces a normalized, learnable scalar weighting for each incoming feature at a fusion node. The fusion at a node with inputs is computed as:
- 0 are unconstrained learnable parameters.
- ReLU ensures 1 for non-negative fusion weights.
- 2 (e.g., 3) prevents division by zero and aids numerical stability.
- In the original design, normalization can alternatively use softmax, but ReLU-normalization achieves comparable accuracy with up to 30% lower GPU latency (Tan et al., 2019, Tang et al., 2024).
This mechanism lets the network dynamically prioritize relevant scales for each spatial region and task instance.
3. Comparison with Related Feature Fusion Designs
BiFPN generalizes earlier pyramid fusion networks:
| Feature Fusion Network | Top-Down | Bottom-Up | Learnable Weights | Repeatable | Convolution Type | Node Pruning/Same-Level Skips |
|---|---|---|---|---|---|---|
| FPN | Yes | No | No | No | 3x3 regular | No |
| PANet | Yes | Yes | No | No | 3x3 regular | No |
| NAS-FPN | Yes | Yes | No | Yes | Architecture search-derived | Varies |
| BiFPN | Yes | Yes | Yes | Yes | Depthwise-separable, variant | Yes |
- BiFPN’s repeated, bidirectional blocks and pruning of single-input nodes set it apart, yielding both improved computational efficiency and feature expressivity (Tan et al., 2019, Tang et al., 2024, Meng et al., 2022).
- EfficientDet’s version uses only depthwise-separable convolutions at all fusion nodes; some extensions incorporate GhostConv, channel-shuffle, or projected convolutions to further reduce parameters (Li et al., 2023, Xu et al., 2022).
- Attention-enhanced BiFPN variants introduce content-adaptive fusion at the node level or region-level sparse attention (see Section 6) (Meng et al., 18 Jun 2025).
4. Empirical Performance and Complexity
BiFPN demonstrably outperforms prior neck designs across various benchmarks and domains. Representative comparisons include:
| Backbone + Neck | AP/mAP | Params | FLOPs | Notable Datasets |
|---|---|---|---|---|
| EfficientNet-B3 + FPN | 40.3 | 21M | 75B | COCO |
| EfficientNet-B3 + BiFPN (no weights) | 43.9 | ~18.5M | ~50B | COCO |
| EfficientNet-B3 + BiFPN (weights) | 44.4 | ~18.5M | ~50B | COCO |
| YOLOv5n + PANet | 67.6% | 1.77M | 4.2B | Fire Detection |
| YOLOv5n + Light-BiFPN | 68.6% | 1.25M | 3.3B | Fire Detection |
| DETR baseline | 39.9 | --- | --- | COCO |
| DETR++ (with BiFPN) | 41.8 | +few% | 410% overhead | COCO, RICO |
- Across detection, segmentation, and sound event localization tasks, the addition of BiFPN provides between 1–4 points improvement in AP or mAP at a fraction of the parameter/FLOP cost relative to conventional FPN/PANet (Tan et al., 2019, Tang et al., 2024, Xu et al., 2022, Meng et al., 2022, Healy et al., 2023).
- On small-object detection benchmarks, incorporation of higher-resolution inputs (e.g., adding a P2 level) via BiFPN yields large relative mAP and recall gains (Ibrahim et al., 2 Apr 2025, Chen et al., 28 Jul 2025).
- Depthwise-separable convolution, GhostConv, and fusion node pruning are critical for achieving these efficiency benefits (Tan et al., 2019, Li et al., 2023).
5. Application Domains and Customizations
Since its original formulation, BiFPN has been adapted for diverse modalities and detection contexts:
- Standard Object Detection: EfficientDet (Tan et al., 2019), YOLO variants (Tang et al., 2024, Ibrahim et al., 2 Apr 2025, Chen et al., 28 Jul 2025), vehicle and traffic sign detection (Li et al., 2023, Ibrahim et al., 2 Apr 2025), remote sensing ship detection (Meng et al., 18 Jun 2025).
- Segmentation: ESeg uses an extended (P3–P9) BiFPN for context aggregation without atrous convolutions, outperforming DeepLabV3+ in both speed and accuracy (Meng et al., 2022).
- Audio/Sound Event Detection: Three-scale BiFPN integrates time-frequency pyramid features in SELD; yields up to 43% reduction in DOA regression error and 7.5% mAP improvement (Healy et al., 2023).
- Diffusion-Based Sensor Fusion: A hierarchical mini-BiFPN (cMini-BiFPN) structures latent multi-sensor diffusion with strong robustness and efficiency (Le et al., 2024).
Common customizations include:
- Channel unification via 1×1 convolutions before fusion.
- Attention augmentation such as integrating BiFormer region routing or SimAM (Meng et al., 18 Jun 2025, Tang et al., 2024).
- Reduced-depth BiFPNs for latency-constrained or lightweight applications (e.g., single block or light variant for YOLOv5n) (Xu et al., 2022, Li et al., 2023).
- Skip-connections and pruning of single-input or dead-end nodes, which empirically improves both accuracy and latency (Tan et al., 2019, Tang et al., 2024).
6. Variants: Attention and Enhanced Fusion
Several studies extend the BiFPN concept with explicit nonlocal attention, content-adaptive fusion, or additional context modules:
- AFBiFPN adds BiFormer region-level and token-level sparse attention at fusion nodes, yielding significant AP improvements—particularly on small and medium object subsets—in SAR ship detection (Meng et al., 18 Jun 2025).
- CFE + BiFPN enhances local feature diversity before fusion using a multi-branch convolutional preprocess, followed by BiFPN and attention for scale-aware context (Meng et al., 18 Jun 2025).
- SimAM and Shuffle Attention Mechanisms are stacked with BiFPN in road-crack detection to further boost spatial discriminability and channel selectivity (Tang et al., 2024).
- Diffusion-conditioned BiFPN (cMini-BiFPN) combines multi-resolution latent denoising with BiFPN-style fusion for robust sensor fusion (Le et al., 2024).
These modifications consistently demonstrate that BiFPN’s learnable fusion weights interact favorably with content-adaptive attention, providing complementary local/global context modeling and further boosting detection and recognition metrics (Tang et al., 2024, Meng et al., 18 Jun 2025).
7. Implementation and Practical Considerations
Best practices for BiFPN integration, according to published benchmarks, include:
- Depthwise-separable convolution at fusion points to minimize parameters and operations (Tan et al., 2019).
- Channel unification via 1×1 convolution prior to fusion at each scale level (especially when scales have mismatched feature widths) (Ibrahim et al., 2 Apr 2025, Chen et al., 28 Jul 2025).
- Limiting the number of BiFPN layers for real-time or memory-constrained inference—typically 1–3 is effective; further stacking yields diminishing returns and increased memory/FLOPs (Chen et al., 28 Jul 2025).
- ReLU-based fusion normalization is more resource-efficient than softmax for BiFPN weight normalization, with nearly identical empirical performance (Tan et al., 2019).
- Careful ablation/combination with lightweight backbones (GhostNet, CSPDarkNet, etc.) and attention modules is critical; BiFPN alone without such context may not always yield net gain (Ruiqiang, 2022).
- Explicit preservation of high-resolution pyramid levels (e.g., P2 at 160×160, or even P3–P9 in segmentation) is central for challenging small-object or dense pixelwise tasks (Meng et al., 2022, Ibrahim et al., 2 Apr 2025, Chen et al., 28 Jul 2025).
References
- "EfficientDet: Scalable and Efficient Object Detection" (Tan et al., 2019)
- "Enhancing Road Crack Detection Accuracy with BsS-YOLO: Optimizing Feature Fusion and Attention Mechanisms" (Tang et al., 2024)
- "Light-YOLOv5: A Lightweight Algorithm for Improved YOLOv5 in Complex Fire Scenarios" (Xu et al., 2022)
- "DETR++: Taming Your Multi-Scale Detection Transformer" (Zhang et al., 2022)
- "Revisiting Multi-Scale Feature Fusion for Semantic Segmentation" (Meng et al., 2022)
- "Enhancing Traffic Sign Recognition On The Performance Based On Yolov8" (Ibrahim et al., 2 Apr 2025)
- "Fast vehicle detection algorithm based on lightweight YOLO7-tiny" (Li et al., 2023)
- "An Improved YOLOv8 Approach for Small Target Detection of Rice Spikelet Flowering in Field Environments" (Chen et al., 28 Jul 2025)
- "Feature Aggregation in Joint Sound Classification and Localization Neural Networks" (Healy et al., 2023)
- "YOLOv5s-GTB: light-weighted and improved YOLOv5s for bridge crack detection" (Ruiqiang, 2022)
- "DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation" (Le et al., 2024)
- "Convolutional Feature Enhancement and Attention Fusion BiFPN for Ship Detection in SAR Images" (Meng et al., 18 Jun 2025)