BiFormer: Vision Transformer with BRA

Updated 9 March 2026

BiFormer is a vision transformer architecture that innovates with bi-level routing attention to dynamically prune irrelevant regions and focus computation.
It achieves state-of-the-art performance across classification, detection, segmentation, and medical imaging by balancing global context with computational efficiency.
Its two-stage routing mechanism combines coarse region-level selection with fine token-to-token attention enhanced by local depth-wise convolution.

BiFormer is a vision transformer architecture designed around bi-level routing attention (BRA), which introduces dynamic, content-aware, query-adaptive sparse attention for visual tasks. The BiFormer framework, which originated in "BiFormer: Vision Transformer with Bi-Level Routing Attention" (Zhu et al., 2023), is engineered to deliver high long-range modeling capacity at tractable computational cost, achieving state-of-the-art results across classification, detection, segmentation, and medical imaging. Its defining principle is a two-stage routing mechanism that prunes irrelevant keys at a coarse spatial level and applies dense attention only to the most useful subregions, resulting in a flexible, efficient allocation of computation.

1. Bi-Level Routing Attention: Core Mechanism and Formulation

Bi-Level Routing Attention (BRA) is a dynamic sparse attention mechanism that addresses the prohibitive $\mathcal O(N^2)$ cost and memory of dense visual self-attention. The BRA process proceeds as follows:

Feature Partitioning: Input features $X \in \mathbb{R}^{H \times W \times C}$ are split into $S \times S$ non-overlapping square regions ("patches"), each containing $M = HW/S^2$ tokens. This yields $P = S^2$ distinct regions.
Linear Projection and Per-Token Embedding: Compute tokenwise $Q = XW^Q$ , $K = XW^K$ , $V = XW^V$ .
Region-Level Aggregation: Each region $i$ is summarized by mean-pooling the token-level embeddings, forming region queries $Q^r_i$ and keys $X \in \mathbb{R}^{H \times W \times C}$ 0.
Region-to-Region Routing: Construct an affinity matrix $X \in \mathbb{R}^{H \times W \times C}$ 1. For each region, select the top- $X \in \mathbb{R}^{H \times W \times C}$ 2 most relevant regions according to $X \in \mathbb{R}^{H \times W \times C}$ 3.
Gather Sparse Token Sets: For each region $X \in \mathbb{R}^{H \times W \times C}$ 4, gather key/value tokens from its top- $X \in \mathbb{R}^{H \times W \times C}$ 5 routed regions, forming $X \in \mathbb{R}^{H \times W \times C}$ 6.
Token-to-Token Attention: For each query token within region $X \in \mathbb{R}^{H \times W \times C}$ 7, compute softmax attention over the gathered keys. The normalized attention weights are used to aggregate the values:

$X \in \mathbb{R}^{H \times W \times C}$ 8

where $X \in \mathbb{R}^{H \times W \times C}$ 9 denotes a local depth-wise convolution to maintain local inductive bias.

The final output is reconstructed by unpatchifying and concatenating the per-region outputs.

This machinery dynamically creates a content-adaptive sparsity pattern at inference, ensuring that each query only attends to regions deemed semantically related on a per-image, per-batch basis (Zhu et al., 2023, Yang et al., 2023, Zhou et al., 2024, Hou et al., 25 Feb 2025).

2. Architectural Instantiations and Integrations

The BiFormer block, embedding BRA, serves as a drop-in replacement for traditional full self-attention modules. BRA is typically paired with:

Depth-wise convolution at the block head for positional encoding.
2-layer MLP with expansion ratio (commonly $S \times S$ 03 or 4) for tokenwise feedforward computation.
Residual connections and normalization as in standard transformer encoders.

BiFormer is employed in various backbone designs:

Four-stage hierarchical transformers with patch embedding, BRA-based transformer blocks, and patch-merging operations to successively reduce spatial resolution and increase feature dimensionality (Zhu et al., 2023, Hou et al., 25 Feb 2025, Zhou et al., 2024).
Hybrid CNN-transformer encoders for U-Net architectures or detector backbones (Cai et al., 2024, Zhou et al., 2024, Yang et al., 2023).
Aggregation within multi-branch networks (e.g., Efficient Aggregation Network, EAGNet) in YOLOv7x, replacing central convolutional branches with BiFormer-based pipelines (Hou et al., 25 Feb 2025).
Specialized integration for task-specific frameworks, such as frame interpolation (Park et al., 2023) and anatomical landmark detection (Zhou et al., 2024).

3. Computational Complexity and Efficiency

BRA reduces dense attention's complexity from $S \times S$ 1 to approximately $S \times S$ 2, where $S \times S$ 3 is the number of spatial regions and $S \times S$ 4 is the number of routed regions per query:

Region-level adjacency: $S \times S$ 5.
Top-k selection: $S \times S$ 6.
Sparse fine attention: $S \times S$ 7.
Local enhancement: $S \times S$ 8 (via DWConv).

Empirical cost reduction is significant. In YOLOv7, the addition of a BRA block reduces the frame rate marginally (e.g., from 140 fps to $S \times S$ 9120 fps on a 2080Ti), while in transformers the GPU implementation remains efficient by leveraging dense matrix multiplications for the reduced-size attention (Zhu et al., 2023, Yang et al., 2023, Hou et al., 25 Feb 2025, Zhou et al., 2024).

4. Empirical Performance and Ablation Analyses

BiFormer and BRA-augmented architectures have established best-in-class results across diverse tasks:

Image Classification (ImageNet-1K): BiFormer-S matches or exceeds Swin-T/B (e.g., 83.8% top-1 for BiFormer-S at 4.5G FLOPs).
Object Detection (COCO): BiFormer-S achieves 45.9 mAP (RetinaNet 1x), outperforming DOT-T and Swin-T (Zhu et al., 2023).
Semantic Segmentation (ADE20K): BiFormer-S and BiFormer-B attain 49.8 and 51.0 mIoU on UperNet, surpassing static sparse baselines (Zhu et al., 2023).
Student Behavior Detection: YOLOv7-BRA yields [email protected] of 87.1% (+2.2 points over best YOLOv7 w/o BRA), particularly improving localization under occlusion (Yang et al., 2023).
Defect Detection: On power equipment, adding BiFormer increases [email protected] from 89.2% to 92.4%; with further improvements, achieves 93.5% and 97.1% precision (Hou et al., 25 Feb 2025).
Medical Imaging: In anatomical landmark detection, BRA improves mean radial error (MRE) and success detection rates (SDR) against both transformer and CNN baselines (Zhou et al., 2024). In segmentation (e.g., on FH-PS-AoP, HC18), BRAU-Net achieves higher Dice and lower HD95/ASD than prior transformer and convolutional methods (Cai et al., 2024).

Ablation studies demonstrate that increasing $M = HW/S^2$ 0 (top-k routing) beyond moderate values yields diminishing or negative returns, acting as implicit regularization. The local context term (DWConv) is crucial for preserving fine details (Zhu et al., 2023, Zhou et al., 2024).

5. Application Domains and Adaptations

BiFormer and BRA modules are now prevalent in a variety of visual domains:

General Vision Transformers: Classification, detection, and segmentation on natural images (Zhu et al., 2023).
Object Detection Pipelines: Enhanced YOLOv7/v7x backbones via integration post-feature extractor, or via multi-branch sparse attention aggregation (Yang et al., 2023, Hou et al., 25 Feb 2025).
Medical Image Analysis: U-Net variants (e.g., BRAU-Net for segmentation (Cai et al., 2024); HYATT-Net for landmark detection (Zhou et al., 2024)) replace convolutions with stacks of BiFormer blocks, leveraging multi-scale, query-adaptive context for precise mask delineation or point localization.
Video Frame Interpolation: Bilateral variants of BiFormer support hierarchical motion estimation, combining global coarse transformer-based flow with fine blockwise refinement (Park et al., 2023).

Empirical results consistently highlight improved suppression of background, enhancement of true-object features, and robustness in cluttered or occluded settings across detection and segmentation tasks.

6. Comparative Analysis and Significance

BRA distinguishes itself from static sparse patterns (e.g., windowed, axial, or dilated attention) by generating content- and query-aware sparsity. Unlike approaches with fixed local structure (e.g., Swin, Twins), each query region in BiFormer dynamically selects attended regions globally per batch/image, enabling capture of semantically related but spatially distant features (Zhu et al., 2023, Zhou et al., 2024). Visualization of attention/Grad-CAM maps confirms that BRA focuses on task-relevant structures while ignoring distractors.

In controlled ablation against SimAM, SE, CBAM, and static windowed or cross-shaped patterns, BRA demonstrates superior accuracy, robustness, and resource efficiency in object-centric and medical tasks (Hou et al., 25 Feb 2025, Zhou et al., 2024, Cai et al., 2024). The learnable, adaptive routing is a key mechanism for this improvement.

7. Variants, Extensions, and Future Directions

BiFormer blocks have been adapted for multiple domains:

Multilevel Hierarchies: Use in four-stage pyramids, with patch-merging for multi-scale feature aggregation.
Hybrid Architectures: Plugged into CNN-transformer hybrids (e.g., HYATT-Net, BRAU-Net) or aggregation cells within YOLOv7x (Zhou et al., 2024, Cai et al., 2024, Hou et al., 25 Feb 2025).
Task-Specific Branches: Tweaked for bilateral learning (motion), medical image analysis, or as part of fusion strategies in multi-modal settings (Park et al., 2023, Yang et al., 2023).
Complementary Modules: For segmentation/medical applications, paired with multipath dilated convolution and inverted bottleneck patch expanding modules to improve upsampling and feature fusion (Cai et al., 2024, Zhou et al., 2024).

The empirical evidence suggests that the bi-level routing attention paradigm is now a leading method for balancing global context and computational tractability, with broad impact across computer vision. Ongoing and future work explores broader domain adaptation, scaling, and deeper integration with specialized neural architectures.