Bi-level Routing Attention in Neural Networks

Updated 19 April 2026

Bi-level Routing Attention is a dynamic sparse mechanism that employs a two-stage process—region-level routing and token-level attention—to efficiently model long-range dependencies.
It reduces the quadratic cost of traditional attention by adaptively restricting computation to a content-aware subset of tokens, achieving subquadratic complexity.
Empirical results in vision transformers like BiFormer and in MoE frameworks such as SMILE demonstrate enhanced accuracy and throughput compared to conventional models.

Bi-level Routing Attention (BRA) denotes a class of dynamic sparse attention mechanisms that combine coarse-grained region-level routing with fine-grained token-level attention to efficiently capture long-range dependencies in neural network architectures. Bi-level routing attention enables query-adaptive sparsity, substantially reducing computational and memory complexity relative to global or local attention patterns, by restricting each query’s receptive field to a content-aware, variable subset of tokens. This general strategy has been instantiated in both vision transformers—such as BiFormer (Zhu et al., 2023) and DeBiFormer (Long et al., 2024)—and distributed mixture-of-experts frameworks—such as SMILE (He et al., 2022)—for effective allocation of computational resources and communication.

1. Mathematical Formulation and Algorithmic Structure

In transformer-based models, bi-level routing attention operates on feature maps $X \in \mathbb{R}^{H \times W \times C}$ , where $H, W, C$ denote spatial and channel dimensions. The attention procedure consists of two hierarchical steps:

1.1 Region-level Routing

Partition $X$ into $S \times S$ non-overlapping regions, each region containing $n_r = N/S^2$ tokens ( $N = H \cdot W$ ).
Compute region-level summaries: $Q^R_i = \frac{1}{n_r} \sum_{j \in \Omega_i} Q^r_{i,j}$ , $K^R_i = \frac{1}{n_r} \sum_{j \in \Omega_i} K^r_{i,j}$ , producing $Q^R, K^R \in \mathbb{R}^{S^2 \times C}$ .
Form the affinity matrix $A^R = Q^R (K^R)^\top \in \mathbb{R}^{S^2 \times S^2}$ .
For each region $H, W, C$ 0, select the indices $H, W, C$ 1 of the top- $H, W, C$ 2 most relevant regions (by $H, W, C$ 3), capturing semantic relationships adaptively.

1.2 Token-level Attention

For each query token in region $H, W, C$ 4, pool all tokens from the top- $H, W, C$ 5 routed regions: $H, W, C$ 6.
Compute token-to-token attention between all queries $H, W, C$ 7 in region $H, W, C$ 8 and the gathered keys $H, W, C$ 9: $X$ 0, $X$ 1.
Optionally, incorporate a local context enhancement (LCE) convolution to enrich the feature contextualization.

This two-stage hierarchical routing confines attention computation to sparse—yet content-adaptive—token subsets, obviating the quadratic $X$ 2 cost of dense self-attention.

2. Theoretical Complexity and Efficiency

The critical advantage of BRA is its reduction of computational and memory costs as compared to global and local attention mechanisms. For input size $X$ 3 and regioning parameters $X$ 4:

Global MHSA: $X$ 5
Window-based: $X$ 6 ( $X$ 7 window size)
BRA: $X$ 8

Optimally tuning $X$ 9 yields a subquadratic overall complexity, typically $S \times S$ 0. Each query effectively attends to $S \times S$ 1 tokens, and hyperparameters $S \times S$ 2 mediate accuracy-computation trade-offs. Memory complexity is correspondingly reduced to $S \times S$ 3, enabling application to high-resolution inputs and dense prediction tasks (Zhu et al., 2023).

In distributed routing (e.g., SMILE (He et al., 2022)), the bi-level mechanism splits routing across inter-node and intra-node collectives, each with drastically lower communication overhead than a full $S \times S$ 4-way All2All. The per-forward communication launch cost is reduced from $S \times S$ 5 to $S \times S$ 6 for $S \times S$ 7 intra-node, $S \times S$ 8 inter-node groups, and the per-iteration All2All time is observed to drop by a factor of $S \times S$ 9.

3. Implementation in Architectures

3.1 Vision Transformers

BRA is implemented in BiFormer by a sequence of dense matrix operations and index-based gather routines, requiring only standard “bmm” batched matrix multiplications on GPU hardware. The method is incorporated into fully U-Net-like architectures (e.g. BRAU-Net (Cai et al., 2023)), hierarchical Imagenet-scale transformers (BiFormer (Zhu et al., 2023)), and deformable variants (DBRA in DeBiFormer (Long et al., 2024)). DeBiFormer further extends the classical BRA design by:

Introducing agent queries via learned deformable offsets.
Performing agent-to-region and region-to-agent attentions, using top- $n_r = N/S^2$ 0 region selection for each agent, effectively fusing content-adaptive routing with spatially flexible deformable sampling.

3.2 Mixture-of-Experts (MoE)

SMILE employs BRA at the communication-routing level: tokens are first dispatched to optimal nodes (inter-node router), then to specific experts within a node (intra-node router), each using a local softmax-projected routing. This hierarchical partitioning exploits hardware topology—fast intra-node (NVSwitch), slow inter-node (EFA)—and achieves 2.5× throughput improvement over Switch Transformer baselines, with maintained convergence and load-balance performance (He et al., 2022).

4. Empirical Results and Comparative Evaluation

4.1 Vision Tasks

BiFormer (BRA) (Zhu et al., 2023):

Achieves 81.4–84.3% top-1 ImageNet-1K accuracy (T/S/B variants), outperforming Swin, CSWin, and DAT in comparable complexity regimes. Improves mAP on COCO detection and instance segmentation, and shows 0.5–2.0 mIoU gain on ADE20K segmentation.

DeBiFormer (DBRA) (Long et al., 2024):

Matches or surpasses BiFormer on Imagenet-1K (up to 84.4% top-1), yields +0.3–0.7 mIoU on ADE20K, and notably increases large-object AP in detection.

Model	Params	FLOPs	Imagenet Top-1	ADE20K mIoU
BiFormer-T	13M	2.2G	81.4%	-
BiFormer-S	26M	4.5G	83.8%	48.9
BiFormer-B	57M	9.8G	84.3%	49.9
DeBiFormer-T	21M	2.6G	81.9%	-
DeBiFormer-B	77M	11.8G	84.4%	>49.9

Ablation studies: Query-adaptive, top- $n_r = N/S^2$ 1 routing outperforms hand-crafted sparsity patterns (windows, stripes, deformations), with accuracy sensitive to both $n_r = N/S^2$ 2 and $n_r = N/S^2$ 3.

4.2 MoE Routing

SMILE (BRA) (He et al., 2022):

With 3.7B params, 16 P4d nodes (128 experts), throughput increases from 8,112 (Switch) to 20,011 samples/sec (SMILE), with equivalent perplexity and loss curves. Scaling remains near-linear up to 128 GPUs, with communication fraction of iteration time cut from 71% to 59%. Larger models (13B, 48B) sustain >1.7× speedup.

Model	All2All time	Total time
Switch Transformer	382 ms (71%)	535 ms
SMILE	86 ms (59%)	146 ms

5. Extensions and Limitations

The deformable variant (DBRA) addresses semantic misalignment and inflexibility in hard grid-based region routing by leveraging learned offsets (“agent queries”) and adding an agent-to-token broadcast step. This augments interpretability (evidenced by focused attention maps in Grad-CAM) and enhances performance, especially on dense and spatially complex segmentation tasks (Long et al., 2024).

Limitations include:

Throughput overhead versus highly optimized local (windowed) attention stemming from extra gather operations and kernel launches; BiFormer lags by ∼30–40% but remains substantially faster than recursive quad-tree schemes.
Hyperparameter sensitivity: extremely small $n_r = N/S^2$ 4 prunes context; overly large $n_r = N/S^2$ 5 loses sparsity benefit. Region grid size $n_r = N/S^2$ 6 must respect semantic structure and divide input size.
BRA can misalign if the region partitioning fails to capture object boundaries; DBRA alleviates but does not eradicate this issue.

6. Contexts Beyond Vision: Distributed and Expert-Parallel Models

BRA principles are also exploited for communication efficiency in expert-parallel models. SMILE demonstrates that bi-level routing can substantially mitigate bandwidth bottlenecks by leveraging network topology, reducing All2All communication to a composition of smaller collectives, and maintaining model convergence and quality. This broadens the relevance of bi-level routing attention to large-scale distributed deep learning and hierarchical mixture-of-experts frameworks (He et al., 2022).

7. Research Impact and Future Directions

Bi-level routing attention constitutes a central advance in the development of adaptive sparse mechanisms for both dense prediction and large-scale distributed architectures. Its combination of region-aware semantic sparsity and implementation simplicity (standard batched matrix ops, efficient gather routines) underpins practical transformer backbones achieving state-of-the-art trade-offs.

Subsequent work, such as DeBiFormer, reflects a trend toward integrating content-adaptive routing with spatially flexible sampling (deformable attention), aiming to increase robustness and interpretability. In distributed learning, bi-level routing is likely to remain critical for balancing load, exploiting hardware hierarchies, and supporting massive scaling in expert-based architectures.

A plausible implication is that ongoing research will focus on further optimizing the routing criterion (beyond top- $n_r = N/S^2$ 7 and affinity), integrating learnable routing schedules, and extending bi-level strategies to general sequence and graph models. There remains an active area of research into the balancing of computation, memory, scalability, and interpretability in hierarchical attention modules.