Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiFormer: Vision Transformer with BRA

Updated 9 March 2026
  • BiFormer is a vision transformer architecture that innovates with bi-level routing attention to dynamically prune irrelevant regions and focus computation.
  • It achieves state-of-the-art performance across classification, detection, segmentation, and medical imaging by balancing global context with computational efficiency.
  • Its two-stage routing mechanism combines coarse region-level selection with fine token-to-token attention enhanced by local depth-wise convolution.

BiFormer is a vision transformer architecture designed around bi-level routing attention (BRA), which introduces dynamic, content-aware, query-adaptive sparse attention for visual tasks. The BiFormer framework, which originated in "BiFormer: Vision Transformer with Bi-Level Routing Attention" (Zhu et al., 2023), is engineered to deliver high long-range modeling capacity at tractable computational cost, achieving state-of-the-art results across classification, detection, segmentation, and medical imaging. Its defining principle is a two-stage routing mechanism that prunes irrelevant keys at a coarse spatial level and applies dense attention only to the most useful subregions, resulting in a flexible, efficient allocation of computation.

1. Bi-Level Routing Attention: Core Mechanism and Formulation

Bi-Level Routing Attention (BRA) is a dynamic sparse attention mechanism that addresses the prohibitive O(N2)\mathcal O(N^2) cost and memory of dense visual self-attention. The BRA process proceeds as follows:

  1. Feature Partitioning: Input features XRH×W×CX \in \mathbb{R}^{H \times W \times C} are split into S×SS \times S non-overlapping square regions ("patches"), each containing M=HW/S2M = HW/S^2 tokens. This yields P=S2P = S^2 distinct regions.
  2. Linear Projection and Per-Token Embedding: Compute tokenwise Q=XWQQ = XW^Q, K=XWKK = XW^K, V=XWVV = XW^V.
  3. Region-Level Aggregation: Each region ii is summarized by mean-pooling the token-level embeddings, forming region queries QirQ^r_i and keys KjrK^r_j.
  4. Region-to-Region Routing: Construct an affinity matrix Ar=Qr(Kr)RP×PA^r = Q^r (K^r)^\top \in \mathbb{R}^{P \times P}. For each region, select the top-kk most relevant regions according to Ai,:rA^r_{i,:}.
  5. Gather Sparse Token Sets: For each region ii, gather key/value tokens from its top-kk routed regions, forming K~i,V~i\widetilde K_i, \widetilde V_i.
  6. Token-to-Token Attention: For each query token within region ii, compute softmax attention over the gathered keys. The normalized attention weights are used to aggregate the values:

Oi=softmax(QiK~i/d)V~i+DWConv(Vi)O_i = \mathrm{softmax}(Q_i \widetilde K_i^\top / \sqrt{d}) \widetilde V_i + \mathrm{DWConv}(V_i)

where DWConv\mathrm{DWConv} denotes a local depth-wise convolution to maintain local inductive bias.

The final output is reconstructed by unpatchifying and concatenating the per-region outputs.

This machinery dynamically creates a content-adaptive sparsity pattern at inference, ensuring that each query only attends to regions deemed semantically related on a per-image, per-batch basis (Zhu et al., 2023, Yang et al., 2023, Zhou et al., 2024, Hou et al., 25 Feb 2025).

2. Architectural Instantiations and Integrations

The BiFormer block, embedding BRA, serves as a drop-in replacement for traditional full self-attention modules. BRA is typically paired with:

  • Depth-wise convolution at the block head for positional encoding.
  • 2-layer MLP with expansion ratio (commonly ×\times3 or 4) for tokenwise feedforward computation.
  • Residual connections and normalization as in standard transformer encoders.

BiFormer is employed in various backbone designs:

3. Computational Complexity and Efficiency

BRA reduces dense attention's complexity from O(N2d)\mathcal O(N^2 d) to approximately O(P2d+Nkd)\mathcal O(P^2 d + N \cdot k \cdot d), where PP is the number of spatial regions and kPk \ll P is the number of routed regions per query:

  • Region-level adjacency: O(P2d)O(P^2 d).
  • Top-k selection: O(P2logP)O(P^2 \log P).
  • Sparse fine attention: O(Nkd)O(N \cdot k \cdot d).
  • Local enhancement: O(Nd)O(N d) (via DWConv).

Empirical cost reduction is significant. In YOLOv7, the addition of a BRA block reduces the frame rate marginally (e.g., from 140 fps to \approx120 fps on a 2080Ti), while in transformers the GPU implementation remains efficient by leveraging dense matrix multiplications for the reduced-size attention (Zhu et al., 2023, Yang et al., 2023, Hou et al., 25 Feb 2025, Zhou et al., 2024).

4. Empirical Performance and Ablation Analyses

BiFormer and BRA-augmented architectures have established best-in-class results across diverse tasks:

  • Image Classification (ImageNet-1K): BiFormer-S matches or exceeds Swin-T/B (e.g., 83.8% top-1 for BiFormer-S at 4.5G FLOPs).
  • Object Detection (COCO): BiFormer-S achieves 45.9 mAP (RetinaNet 1x), outperforming DOT-T and Swin-T (Zhu et al., 2023).
  • Semantic Segmentation (ADE20K): BiFormer-S and BiFormer-B attain 49.8 and 51.0 mIoU on UperNet, surpassing static sparse baselines (Zhu et al., 2023).
  • Student Behavior Detection: YOLOv7-BRA yields [email protected] of 87.1% (+2.2 points over best YOLOv7 w/o BRA), particularly improving localization under occlusion (Yang et al., 2023).
  • Defect Detection: On power equipment, adding BiFormer increases [email protected] from 89.2% to 92.4%; with further improvements, achieves 93.5% and 97.1% precision (Hou et al., 25 Feb 2025).
  • Medical Imaging: In anatomical landmark detection, BRA improves mean radial error (MRE) and success detection rates (SDR) against both transformer and CNN baselines (Zhou et al., 2024). In segmentation (e.g., on FH-PS-AoP, HC18), BRAU-Net achieves higher Dice and lower HD95/ASD than prior transformer and convolutional methods (Cai et al., 2024).

Ablation studies demonstrate that increasing kk (top-k routing) beyond moderate values yields diminishing or negative returns, acting as implicit regularization. The local context term (DWConv) is crucial for preserving fine details (Zhu et al., 2023, Zhou et al., 2024).

5. Application Domains and Adaptations

BiFormer and BRA modules are now prevalent in a variety of visual domains:

  • General Vision Transformers: Classification, detection, and segmentation on natural images (Zhu et al., 2023).
  • Object Detection Pipelines: Enhanced YOLOv7/v7x backbones via integration post-feature extractor, or via multi-branch sparse attention aggregation (Yang et al., 2023, Hou et al., 25 Feb 2025).
  • Medical Image Analysis: U-Net variants (e.g., BRAU-Net for segmentation (Cai et al., 2024); HYATT-Net for landmark detection (Zhou et al., 2024)) replace convolutions with stacks of BiFormer blocks, leveraging multi-scale, query-adaptive context for precise mask delineation or point localization.
  • Video Frame Interpolation: Bilateral variants of BiFormer support hierarchical motion estimation, combining global coarse transformer-based flow with fine blockwise refinement (Park et al., 2023).

Empirical results consistently highlight improved suppression of background, enhancement of true-object features, and robustness in cluttered or occluded settings across detection and segmentation tasks.

6. Comparative Analysis and Significance

BRA distinguishes itself from static sparse patterns (e.g., windowed, axial, or dilated attention) by generating content- and query-aware sparsity. Unlike approaches with fixed local structure (e.g., Swin, Twins), each query region in BiFormer dynamically selects attended regions globally per batch/image, enabling capture of semantically related but spatially distant features (Zhu et al., 2023, Zhou et al., 2024). Visualization of attention/Grad-CAM maps confirms that BRA focuses on task-relevant structures while ignoring distractors.

In controlled ablation against SimAM, SE, CBAM, and static windowed or cross-shaped patterns, BRA demonstrates superior accuracy, robustness, and resource efficiency in object-centric and medical tasks (Hou et al., 25 Feb 2025, Zhou et al., 2024, Cai et al., 2024). The learnable, adaptive routing is a key mechanism for this improvement.

7. Variants, Extensions, and Future Directions

BiFormer blocks have been adapted for multiple domains:

The empirical evidence suggests that the bi-level routing attention paradigm is now a leading method for balancing global context and computational tractability, with broad impact across computer vision. Ongoing and future work explores broader domain adaptation, scaling, and deeper integration with specialized neural architectures.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BiFormer.