RevBiFPN: Reversible Multi-Scale Vision Backbone

Updated 5 April 2026

RevBiFPN is a reversible bidirectional feature pyramid network that fuses multi-scale features using lossless backpropagation without storing intermediate activations.
It employs the innovative RevSilo module with additive coupling operations to maintain constant activation memory regardless of network depth.
Empirical evaluations on ImageNet-1K and MS COCO show competitive accuracy with significantly reduced memory usage compared to traditional non-reversible methods.

RevBiFPN is a fully reversible bidirectional feature pyramid network architecture designed to drastically reduce training-time memory for multi-scale vision backbones. It addresses the memory demands of spatially sensitive tasks—where bidirectional multi-scale feature fusion is essential—by enabling lossless backpropagation without storing intermediate activations. Fundamental to RevBiFPN is the RevSilo module, which provides the first reversible mechanism for multi-scale feature fusion and allows memory usage to remain constant with network depth. Empirical evaluation demonstrates that RevBiFPN delivers competitive or superior accuracy compared to state-of-the-art baselines, at a fraction of their activation memory costs (Chiley et al., 2022).

1. Architectural Foundations

At the core of RevBiFPN lies the RevSilo, a fully reversible bidirectional multi-scale fusion module. A RevBiFPN backbone is constructed by serially stacking $d$ RevSilos, interleaved with reversible residual blocks at each resolution scale. The canonical dataflow is:

SpaceToDepth Stem: An invertible downsampling that reduces input spatial resolution by $4\times$ and increases channel width;
Initial Feature Maps: The stem produces $N=4$ feature maps $(h_0, h_1, h_2, h_3)$ at different resolutions;
Stacked RevSilos: Each RevSilo fuses information across all scales in a reversible manner, producing new $N$ -scale outputs;
Task-Specific Head: The resulting four-scale feature pyramid is consumed by the downstream prediction head.

The forward data flow can be illustrated as:

$N=4$ 3

Within each RevSilo, information is fused downwards (coarse-to-fine) in the first half and upwards (fine-to-coarse) in the second, using reversible residual coupling.

2. Reversible Multi-Scale Fusion Mechanism

RevSilo achieves reversibility by formulating every fusion step as a sequence of additive coupling operations—each with an exact, closed-form inverse. For $N=4$ scales, the forward and backward passes are defined as:

Forward:

$N=4$ 4

Backward (Recomputation):

$N=4$ 5 During training, only the $N$ outputs of each RevSilo and the network parameters are cached. All intermediate activations are re-materialized on-demand in the backward pass.

3. Computational and Memory Complexity

RevBiFPN reduces peak activation memory from $O(d)$ (for $d$ stacked fusion modules) to $O(1)$ with respect to depth. Denoting $4\times$ 0 as the number of RevSilo modules, $4\times$ 1 as the number of scales, $4\times$ 2 as the MACs (multiply-accumulate operations) per module, and $4\times$ 3 as activation memory per module, the following expressions describe cost and memory:

Non-reversible BiFPN:

$4\times$ 4

RevBiFPN:

$4\times$ 5

where $4\times$ 6 is the recomputation factor.

Original activation memory is

$4\times$ 7

while memory under reversibility is

$4\times$ 8

This enables scaling up network depth and input resolution with negligible impact on memory consumption.

4. Empirical Evaluation and Benchmarks

Extensive experiments were conducted across image classification (ImageNet-1K), detection (MS COCO), and instance segmentation benchmarks.

(a) ImageNet-1K Classification

Model	Params (M)	MACs (B)	Top-1 (%)	Train-mem (GB/sample)
RevBiFPN-S4	48.7	10.6	83.0	0.23
EfficientNet-B5	30.0	9.9	83.6	1.44
RevBiFPN-S6	142.3	38.1	84.2	0.25
EfficientNet-B7	66.0	37.0	84.3	5.05

RevBiFPN-S6 matches EfficientNet-B7’s Top-1 accuracy (84.3%) at comparable computational cost but uses approximately 19.8× less GPU memory per sample.

(b) MS COCO Object Detection (Faster R-CNN)

Backbone	MACs (B)	Train-Mem (GB)	AP
RevBiFPN-S3	181	1.31	38.7
HRNetV2p-W18	196	3.13	36.2
RevBiFPN-S5	329	2.75	41.3
HRNetV2p-W32	299	4.31	39.6

RevBiFPN-S3 outperforms HRNetV2p-W18 by 2.5 AP using less than half the memory; RevBiFPN-S5 outperforms HRNetV2p-W32 by 1.7 AP using approximately 36% less memory.

(c) MS COCO Instance Segmentation (Mask R-CNN)

Backbone	MACs (B)	Train-Mem (GB)	Mask AP	BBox AP
RevBiFPN-S2	210	1.06	33.7	37.1
HRNetV2p-W18	249	3.33	33.8	37.1

RevBiFPN-S2 matches HRNetV2p-W18 in Mask AP while using approximately 3× less memory.

5. Practical Trade-offs and Limitations

The fully reversible design of RevBiFPN introduces computational trade-offs and practical considerations:

Computation Overhead: Reversible recomputation adds $4\times$ 9– $N=4$ 0 extra operations in practice (theoretically $N=4$ 1), though this overhead decreases at larger model scales (e.g., S6: $N=4$ 2 slowdown).
Finite-Precision Drift: Negligible; forward-backward equivalence is maintained to full floating-point accuracy.
Energy Utilization: Increase in FLOPs due to recomputation raises energy consumption, but the ability to accommodate larger batch sizes and resolutions may reduce overall training time and improve hardware utilization.
Hardware Constraints: Good on-chip memory is necessary to hold per-scale activations; parameters may be streamed from off-chip as needed.
Architectural Flexibility: The requirement for additive or affine coupling imposes constraints, though found sufficiently flexible for multi-scale fusion.

6. Significance and Context

RevBiFPN resolves a longstanding bottleneck in high-resolution, multi-scale computer vision backbones by removing depth-dependent memory constraints. By implementing the first invertible multi-scale fusion module, RevBiFPN enables efficient scaling of both resolution and backbone depth on standard hardware while maintaining or improving task accuracy compared to non-reversible variants such as EfficientNet and HRNetV2p. This design admits broader multi-scale architectures while providing practical memory and hardware benefits for large-scale vision tasks (Chiley et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

RevBiFPN: The Fully Reversible Bidirectional Feature Pyramid Network (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RevBiFPN.

RevBiFPN: Reversible Multi-Scale Vision Backbone

1. Architectural Foundations

2. Reversible Multi-Scale Fusion Mechanism

3. Computational and Memory Complexity

4. Empirical Evaluation and Benchmarks

(a) ImageNet-1K Classification

(b) MS COCO Object Detection (Faster R-CNN)

(c) MS COCO Instance Segmentation (Mask R-CNN)

5. Practical Trade-offs and Limitations

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RevBiFPN: Reversible Multi-Scale Vision Backbone

1. Architectural Foundations

2. Reversible Multi-Scale Fusion Mechanism

3. Computational and Memory Complexity

4. Empirical Evaluation and Benchmarks

(a) ImageNet-1K Classification

(b) MS COCO Object Detection (Faster R-CNN)

(c) MS COCO Instance Segmentation (Mask R-CNN)

5. Practical Trade-offs and Limitations

6. Significance and Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research