2000 character limit reached

PRB-FPN-Net: Parallel Residual Bi-Fusion FPN

Updated 11 November 2025

The paper introduces parallel, bidirectional residual fusion modules that preserve scale-specific details, enhancing the detection of tiny objects.
It integrates CORE and BFM modules into three independent fusion pyramids, ensuring efficient gradient flow and precise multi-scale feature aggregation.
Performance benchmarks show improved small object AP and robust results across detection tasks and related domains like medical image registration.

The Parallel Residual Bi-Fusion Feature Pyramid Network (PRB-FPN-Net) is a convolutional neural network architecture designed to enhance single-shot object detection—particularly for tiny and small objects—by orchestrating parallel multi-scale feature fusion through residual and bidirectional (top-down and bottom-up) connections. This approach replaces the canonical single-path FPN with multiple parallel pyramids, each incorporating both bottom-up and top-down fusion aided by residual shortcuts. Deployed atop various advanced multiscale backbones such as ELAN, MSP, and CSP, PRB-FPN-Net systematically preserves scale-specific information, facilitating robust detection performance with efficient computation and effective gradient flow (Huang et al., 7 Nov 2025, Chen et al., 2020, Chen et al., 2019, Zhang et al., 8 May 2025).

1. Architectural Foundation and Motivation

PRB-FPN-Net addresses key challenges in object detection FPNs, including the loss of precise localization in deep top-down pathways and insufficient aggregation of fine spatial details, especially detrimental for tiny object detection. Conventional FPNs aggregate semantically strong features in a top-down manner but are susceptible to position shifts introduced by pooling and upsampling operations. As a result, adding depth to vanilla FPNs yields diminishing or degraded accuracy on small objects (Chen et al., 2019).

PRB-FPN-Net overcomes these limitations by:

Parallelizing feature pyramids: Three Bi-Fusion pyramids are maintained, each with independent fusion streams, which preserves scale-specific feature streams and prevents inter-scale feature dilution.
Bidirectional fusion: Each pyramid employs both bottom-up (CORE) and top-down (BFM) fusion modules, propagating low-level spatial cues upward and high-level semantics downward.
Residual design: Residual shortcuts within each fusion module facilitate gradient propagation and stabilize deeper stacks, allowing more aggressive multi-level fusion.

This design yields enhanced localization capabilities for small and complex objects at a modest computational and parameter overhead relative to single-path FPNs.

2. Core Modules: Parallel Residual Fusion and Bi-Fusion Layers

The core building block of PRB-FPN-Net is the Parallel Residual Module, used in both CORE (bottom-up) and BFM (top-down) fusion paths. Each module processes its input via two convolutional branches (commonly 3×3 convolutions), fuses the results via channel-wise concatenation followed by a 1×1 convolution, and adds the original input as a residual:

$F_1(X) = \mathrm{Conv}_{3 \times 3}(\mathrm{BN}(\mathrm{ReLU}(X))), \quad F_2(X) = \mathrm{Conv}_{3 \times 3}(\mathrm{BN}(\mathrm{ReLU}(F_1(X))))$

$Y = \mathrm{Conv}_{1 \times 1}([F_1(X) \Vert F_2(X)]) + X$

where $[\,\cdot\,\Vert\,\cdot\,]$ denotes channel-wise concatenation. The module can be extended by varying the depth of the branches, but the fusion-plus-residual pattern is invariant (Huang et al., 7 Nov 2025).

Bi-Fusion Feature Pyramid Construction

At each of the four pyramid levels (k=1..4), for each of the three pyramid streams (indexed by $j$ ):

Bottom-up CORE pathway:

$\mathrm{CORE}^j_k = \psi_{\mathrm{bot}}(X_{4-k}, X_{3-k}, \mathrm{CORE}^j_{k-1}); \quad \mathrm{CORE}^j_0=0$

Top-down BFM (Bi-Fusion Module) pathway:

$\mathrm{BFM}^j_k = \phi_{\mathrm{td}}(\mathrm{CORE}^j_k, \mathrm{Upsample}(\mathrm{BFM}^j_{k+1}));\quad \mathrm{BFM}^j_5=0$

Both $\psi_{\mathrm{bot}}$ and $\phi_{\mathrm{td}}$ are instantiated as parallel residual modules. This bidirectional, residual structure allows the pyramid to maintain and propagate both local details and semantic context across scales.

3. Integration with Advanced Multi-Scale Backbones

PRB-FPN-Net's design is backbone-agnostic, accepting multi-scale feature outputs from various modern architectures:

CSPNet: Provides four feature maps with channels $\{256,512,1024,2048\}$ ; each corresponding to a residual stage.
MSPNet: Employs mixed-stage partial connections, yielding four scales with channel counts $\{320,640,1280,2560\}$ .
ELAN: Utilizes efficient layer aggregation to merge multiple shallow convolutional paths per stage, forming four scale levels.

The four backbone feature maps ( $\{X_1,\ldots,X_4\}$ ) are fed in parallel to the three Bi-Fusion modules. Each module independently processes these inputs through its own CORE and BFM paths, ensuring robust and non-interfering multi-scale feature extraction and fusion (Huang et al., 7 Nov 2025).

4. Forward Propagation, Head Structure, and Training Workflow

The typical inference pipeline comprises:

Feature Extraction: Input of resolution 640×640 processed through the backbone to emit four feature maps.
Parallel Bi-Fusion Pyramids:
- For each of the three pyramids:
  - Sequentially build up CORE modules per level, fusing input features and previous COREs.
  - Apply the BFM path (downward) per level, upsampling and fusing with CORE outputs.
Prediction Heads: For each scale, concatenate BFM and CORE outputs across all streams to form lead and auxiliary features, each passed to dedicated detection heads for classification, bounding box regression, and objectness scoring (via small 3×3 convolutional stacks).
Postprocessing: Standard anchor-based decoding, non-maximum suppression, and result rendering.

Pseudo-code summary:

X1, X2, X3, X4 = Backbone(image)
for j in {1,2,3}:
    CORE[j][1] = PR_Module(concat(X3, X2, zeros))
    for k in 2..4:
        bottom_inputs = concat(X_{4 - k}, X_{3 - k}, CORE[j][k-1])
        CORE[j][k] = PR_Module(bottom_inputs)
    BFM[j][4] = PR_Module(CORE[j][4])
    for k in 3..1:
        td_inputs = concat(CORE[j][k], Upsample(BFM[j][k+1]))
        BFM[j][k] = PR_Module(td_inputs)
for k in 1..4:
    LeadFusion[k] = concat(BFM[1][k], BFM[2][k], BFM[3][k])
    AuxFusion[k] = concat(CORE[1][k], CORE[2][k], CORE[3][k])
    cls_k, reg_k, obj_k = Head(LeadFusion[k], AuxFusion[k])
return detections

All concatenation steps use a trailing 1×1 convolution plus BN and ReLU. Training uses SGD with momentum (0.9), weight decay (0.0005), cosine learning rate decay over 300 epochs, batch size 32, and stabilization via frozen BatchNorm after three epochs (Huang et al., 7 Nov 2025).

5. Empirical Results and Performance Characteristics

PRB-FPN-Net demonstrates superior accuracy and efficiency on large-scale detection benchmarks. Performance metrics include:

Variant	Params (M)	GFLOPs	AP_s (COCO)
Proposed-CSP	58.8	153.6	31.1%
Proposed-ELAN	96.4	252.8	31.1%
Proposed-MSP	101.1	368.1	35.4%

On COCO2017 val, the cross-modal system integrating PRB-FPN-Net as the vision component achieves a 52.6% average precision (AP), outperforming YOLO-World significantly and maintaining roughly half the parameter count of transformer-based methods such as GLIP (Huang et al., 7 Nov 2025).

Ablation studies indicate:

Maintaining three parallel fusion streams enhances the preservation and merging of high-resolution features vital for tiny object detection.
Residual shortcuts ensure stable training and avoid over-smoothing, even in deep pyramids.
Small-object AP (AP_s) consistently increased: For example, PRB-FPN with Re-CORE and BFM achieves AP_s up to 19.0 (YOLOv3 backbone), outperforming both baseline FPNs and single-path fusion strategies (Chen et al., 2019, Chen et al., 2020).

PRB-FPN-Net distinguishes itself from both standard FPNs and classic bidirectional architectures by:

Parallel scale-specific fusion: Unlike single-path bidirectional FPNs, PRB-FPN-Net executes N parallel fusion pyramids (typically N=3), decoupling feature streams for each detection scale and aggregating their outputs only at the head stage. This prevents "competition" or mutual suppression of small- and large-object cues.
Integration with residual fusion blocks: Each feature fusion block injects a shortcut connection to its input, ensuring gradient flow and enabling deeper and wider pyramids without training instability (Chen et al., 2019, Chen et al., 2020).
Efficient multi-scale context maintenance: By fusing and purifying features at each pyramid stage using residual bottleneck sub-blocks, PRB-FPN-Net achieves robust context aggregation with moderate overhead.

The method is agnostic to the precise backbone, supporting a spectrum of architectures, including those specially designed for lightweight, high-resolution contexts.

7. Broader Applications and Impact

Beyond its deployment as the vision backbone in cross-modal tiny object detectors (Huang et al., 7 Nov 2025), PRB-FPN-Net's principles have been adapted to related domains:

Medical image registration: As the core of FF-PNet (Zhang et al., 8 May 2025), PRB-FPN-Net demonstrates the value of parallel residual bi-fusion strategies for efficiently capturing multi-scale anatomical features and nonrigid deformations, achieving highest Dice Similarity Coefficient (DSC) scores on LPBA and OASIS benchmarks, while maintaining low parameter count and computational complexity.

The overall impact is a set of architectural patterns for robust feature pyramid construction, emphasizing parallelization, bidirectional fusion, and residual connections as foundational elements for precise localization in dense prediction tasks. These properties make PRB-FPN-Net particularly suited for scenarios with dense, small, or irregular targets, as well as domains constrained by compute or memory budgets.

PRB-FPN-Net represents a unified architectural family in both detection and registration, where parallel, residual, and bidirectional information flows are coordinated for enhanced multi-scale feature aggregation. Its systematic improvements in robustness, scalability, and accuracy set new empirical benchmarks in both object detection and volumetric image registration (Huang et al., 7 Nov 2025, Chen et al., 2020, Chen et al., 2019, Zhang et al., 8 May 2025).