Papers
Topics
Authors
Recent
Search
2000 character limit reached

RevBiFPN: Reversible Multi-Scale Vision Backbone

Updated 5 April 2026
  • RevBiFPN is a reversible bidirectional feature pyramid network that fuses multi-scale features using lossless backpropagation without storing intermediate activations.
  • It employs the innovative RevSilo module with additive coupling operations to maintain constant activation memory regardless of network depth.
  • Empirical evaluations on ImageNet-1K and MS COCO show competitive accuracy with significantly reduced memory usage compared to traditional non-reversible methods.

RevBiFPN is a fully reversible bidirectional feature pyramid network architecture designed to drastically reduce training-time memory for multi-scale vision backbones. It addresses the memory demands of spatially sensitive tasks—where bidirectional multi-scale feature fusion is essential—by enabling lossless backpropagation without storing intermediate activations. Fundamental to RevBiFPN is the RevSilo module, which provides the first reversible mechanism for multi-scale feature fusion and allows memory usage to remain constant with network depth. Empirical evaluation demonstrates that RevBiFPN delivers competitive or superior accuracy compared to state-of-the-art baselines, at a fraction of their activation memory costs (Chiley et al., 2022).

1. Architectural Foundations

At the core of RevBiFPN lies the RevSilo, a fully reversible bidirectional multi-scale fusion module. A RevBiFPN backbone is constructed by serially stacking dd RevSilos, interleaved with reversible residual blocks at each resolution scale. The canonical dataflow is:

  • SpaceToDepth Stem: An invertible downsampling that reduces input spatial resolution by 4×4\times and increases channel width;
  • Initial Feature Maps: The stem produces N=4N=4 feature maps (h0,h1,h2,h3)(h_0, h_1, h_2, h_3) at different resolutions;
  • Stacked RevSilos: Each RevSilo fuses information across all scales in a reversible manner, producing new NN-scale outputs;
  • Task-Specific Head: The resulting four-scale feature pyramid is consumed by the downstream prediction head.

The forward data flow can be illustrated as:

N=4N=43

Within each RevSilo, information is fused downwards (coarse-to-fine) in the first half and upwards (fine-to-coarse) in the second, using reversible residual coupling.

2. Reversible Multi-Scale Fusion Mechanism

RevSilo achieves reversibility by formulating every fusion step as a sequence of additive coupling operations—each with an exact, closed-form inverse. For N=4N=4 scales, the forward and backward passes are defined as:

Forward:

N=4N=44

Backward (Recomputation):

N=4N=45 During training, only the NN outputs of each RevSilo and the network parameters are cached. All intermediate activations are re-materialized on-demand in the backward pass.

3. Computational and Memory Complexity

RevBiFPN reduces peak activation memory from O(d)O(d) (for dd stacked fusion modules) to O(1)O(1) with respect to depth. Denoting 4×4\times0 as the number of RevSilo modules, 4×4\times1 as the number of scales, 4×4\times2 as the MACs (multiply-accumulate operations) per module, and 4×4\times3 as activation memory per module, the following expressions describe cost and memory:

  • Non-reversible BiFPN:

4×4\times4

  • RevBiFPN:

4×4\times5

where 4×4\times6 is the recomputation factor.

Original activation memory is

4×4\times7

while memory under reversibility is

4×4\times8

This enables scaling up network depth and input resolution with negligible impact on memory consumption.

4. Empirical Evaluation and Benchmarks

Extensive experiments were conducted across image classification (ImageNet-1K), detection (MS COCO), and instance segmentation benchmarks.

(a) ImageNet-1K Classification

Model Params (M) MACs (B) Top-1 (%) Train-mem (GB/sample)
RevBiFPN-S4 48.7 10.6 83.0 0.23
EfficientNet-B5 30.0 9.9 83.6 1.44
RevBiFPN-S6 142.3 38.1 84.2 0.25
EfficientNet-B7 66.0 37.0 84.3 5.05

RevBiFPN-S6 matches EfficientNet-B7’s Top-1 accuracy (84.3%) at comparable computational cost but uses approximately 19.8× less GPU memory per sample.

(b) MS COCO Object Detection (Faster R-CNN)

Backbone MACs (B) Train-Mem (GB) AP
RevBiFPN-S3 181 1.31 38.7
HRNetV2p-W18 196 3.13 36.2
RevBiFPN-S5 329 2.75 41.3
HRNetV2p-W32 299 4.31 39.6

RevBiFPN-S3 outperforms HRNetV2p-W18 by 2.5 AP using less than half the memory; RevBiFPN-S5 outperforms HRNetV2p-W32 by 1.7 AP using approximately 36% less memory.

(c) MS COCO Instance Segmentation (Mask R-CNN)

Backbone MACs (B) Train-Mem (GB) Mask AP BBox AP
RevBiFPN-S2 210 1.06 33.7 37.1
HRNetV2p-W18 249 3.33 33.8 37.1

RevBiFPN-S2 matches HRNetV2p-W18 in Mask AP while using approximately 3× less memory.

5. Practical Trade-offs and Limitations

The fully reversible design of RevBiFPN introduces computational trade-offs and practical considerations:

  • Computation Overhead: Reversible recomputation adds 4×4\times9–N=4N=40 extra operations in practice (theoretically N=4N=41), though this overhead decreases at larger model scales (e.g., S6: N=4N=42 slowdown).
  • Finite-Precision Drift: Negligible; forward-backward equivalence is maintained to full floating-point accuracy.
  • Energy Utilization: Increase in FLOPs due to recomputation raises energy consumption, but the ability to accommodate larger batch sizes and resolutions may reduce overall training time and improve hardware utilization.
  • Hardware Constraints: Good on-chip memory is necessary to hold per-scale activations; parameters may be streamed from off-chip as needed.
  • Architectural Flexibility: The requirement for additive or affine coupling imposes constraints, though found sufficiently flexible for multi-scale fusion.

6. Significance and Context

RevBiFPN resolves a longstanding bottleneck in high-resolution, multi-scale computer vision backbones by removing depth-dependent memory constraints. By implementing the first invertible multi-scale fusion module, RevBiFPN enables efficient scaling of both resolution and backbone depth on standard hardware while maintaining or improving task accuracy compared to non-reversible variants such as EfficientNet and HRNetV2p. This design admits broader multi-scale architectures while providing practical memory and hardware benefits for large-scale vision tasks (Chiley et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RevBiFPN.