RevBiFPN: Reversible Multi-Scale Vision Backbone
- RevBiFPN is a reversible bidirectional feature pyramid network that fuses multi-scale features using lossless backpropagation without storing intermediate activations.
- It employs the innovative RevSilo module with additive coupling operations to maintain constant activation memory regardless of network depth.
- Empirical evaluations on ImageNet-1K and MS COCO show competitive accuracy with significantly reduced memory usage compared to traditional non-reversible methods.
RevBiFPN is a fully reversible bidirectional feature pyramid network architecture designed to drastically reduce training-time memory for multi-scale vision backbones. It addresses the memory demands of spatially sensitive tasks—where bidirectional multi-scale feature fusion is essential—by enabling lossless backpropagation without storing intermediate activations. Fundamental to RevBiFPN is the RevSilo module, which provides the first reversible mechanism for multi-scale feature fusion and allows memory usage to remain constant with network depth. Empirical evaluation demonstrates that RevBiFPN delivers competitive or superior accuracy compared to state-of-the-art baselines, at a fraction of their activation memory costs (Chiley et al., 2022).
1. Architectural Foundations
At the core of RevBiFPN lies the RevSilo, a fully reversible bidirectional multi-scale fusion module. A RevBiFPN backbone is constructed by serially stacking RevSilos, interleaved with reversible residual blocks at each resolution scale. The canonical dataflow is:
- SpaceToDepth Stem: An invertible downsampling that reduces input spatial resolution by and increases channel width;
- Initial Feature Maps: The stem produces feature maps at different resolutions;
- Stacked RevSilos: Each RevSilo fuses information across all scales in a reversible manner, producing new -scale outputs;
- Task-Specific Head: The resulting four-scale feature pyramid is consumed by the downstream prediction head.
The forward data flow can be illustrated as:
3
Within each RevSilo, information is fused downwards (coarse-to-fine) in the first half and upwards (fine-to-coarse) in the second, using reversible residual coupling.
2. Reversible Multi-Scale Fusion Mechanism
RevSilo achieves reversibility by formulating every fusion step as a sequence of additive coupling operations—each with an exact, closed-form inverse. For scales, the forward and backward passes are defined as:
Forward:
4
Backward (Recomputation):
5 During training, only the outputs of each RevSilo and the network parameters are cached. All intermediate activations are re-materialized on-demand in the backward pass.
3. Computational and Memory Complexity
RevBiFPN reduces peak activation memory from (for stacked fusion modules) to with respect to depth. Denoting 0 as the number of RevSilo modules, 1 as the number of scales, 2 as the MACs (multiply-accumulate operations) per module, and 3 as activation memory per module, the following expressions describe cost and memory:
- Non-reversible BiFPN:
4
- RevBiFPN:
5
where 6 is the recomputation factor.
Original activation memory is
7
while memory under reversibility is
8
This enables scaling up network depth and input resolution with negligible impact on memory consumption.
4. Empirical Evaluation and Benchmarks
Extensive experiments were conducted across image classification (ImageNet-1K), detection (MS COCO), and instance segmentation benchmarks.
(a) ImageNet-1K Classification
| Model | Params (M) | MACs (B) | Top-1 (%) | Train-mem (GB/sample) |
|---|---|---|---|---|
| RevBiFPN-S4 | 48.7 | 10.6 | 83.0 | 0.23 |
| EfficientNet-B5 | 30.0 | 9.9 | 83.6 | 1.44 |
| RevBiFPN-S6 | 142.3 | 38.1 | 84.2 | 0.25 |
| EfficientNet-B7 | 66.0 | 37.0 | 84.3 | 5.05 |
RevBiFPN-S6 matches EfficientNet-B7’s Top-1 accuracy (84.3%) at comparable computational cost but uses approximately 19.8× less GPU memory per sample.
(b) MS COCO Object Detection (Faster R-CNN)
| Backbone | MACs (B) | Train-Mem (GB) | AP |
|---|---|---|---|
| RevBiFPN-S3 | 181 | 1.31 | 38.7 |
| HRNetV2p-W18 | 196 | 3.13 | 36.2 |
| RevBiFPN-S5 | 329 | 2.75 | 41.3 |
| HRNetV2p-W32 | 299 | 4.31 | 39.6 |
RevBiFPN-S3 outperforms HRNetV2p-W18 by 2.5 AP using less than half the memory; RevBiFPN-S5 outperforms HRNetV2p-W32 by 1.7 AP using approximately 36% less memory.
(c) MS COCO Instance Segmentation (Mask R-CNN)
| Backbone | MACs (B) | Train-Mem (GB) | Mask AP | BBox AP |
|---|---|---|---|---|
| RevBiFPN-S2 | 210 | 1.06 | 33.7 | 37.1 |
| HRNetV2p-W18 | 249 | 3.33 | 33.8 | 37.1 |
RevBiFPN-S2 matches HRNetV2p-W18 in Mask AP while using approximately 3× less memory.
5. Practical Trade-offs and Limitations
The fully reversible design of RevBiFPN introduces computational trade-offs and practical considerations:
- Computation Overhead: Reversible recomputation adds 9–0 extra operations in practice (theoretically 1), though this overhead decreases at larger model scales (e.g., S6: 2 slowdown).
- Finite-Precision Drift: Negligible; forward-backward equivalence is maintained to full floating-point accuracy.
- Energy Utilization: Increase in FLOPs due to recomputation raises energy consumption, but the ability to accommodate larger batch sizes and resolutions may reduce overall training time and improve hardware utilization.
- Hardware Constraints: Good on-chip memory is necessary to hold per-scale activations; parameters may be streamed from off-chip as needed.
- Architectural Flexibility: The requirement for additive or affine coupling imposes constraints, though found sufficiently flexible for multi-scale fusion.
6. Significance and Context
RevBiFPN resolves a longstanding bottleneck in high-resolution, multi-scale computer vision backbones by removing depth-dependent memory constraints. By implementing the first invertible multi-scale fusion module, RevBiFPN enables efficient scaling of both resolution and backbone depth on standard hardware while maintaining or improving task accuracy compared to non-reversible variants such as EfficientNet and HRNetV2p. This design admits broader multi-scale architectures while providing practical memory and hardware benefits for large-scale vision tasks (Chiley et al., 2022).