Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BiFormer: Vision Transformer with Bi-Level Routing Attention (2303.08810v1)

Published 15 Mar 2023 in cs.CV

Abstract: As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (\ie, routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a \textbf{query adaptive} manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at \url{https://github.com/rayleizhu/BiFormer}.

BiFormer: Vision Transformer with Bi-Level Routing Attention

The paper, "BiFormer: Vision Transformer with Bi-Level Routing Attention," presents an innovative approach to enhance the computational efficiency and performance of vision transformers. This work addresses the inherent computational challenges of vision transformers, where the dense attention mechanism incurs significant computational and memory overhead by processing pairwise token interactions globally.

Key Contributions

The central contribution of this research is the introduction of Bi-Level Routing Attention (BRA), a dynamic and query-aware sparse attention mechanism. Unlike existing sparse attention methods that utilize static or query-agnostic sparsity patterns, BRA employs a novel approach based on bi-level routing that adapts the allocation of resources in response to content.

  1. Bi-Level Routing Attention (BRA):
    • BRA first filters irrelevant key-value pairs at a coarse region level using a region-to-region routing mechanism. This is implemented with a region-level graph that prunes connections to retain only the top-kk regions for attention.
    • In the token-to-token attention phase, the method gathers relevant key-value pairs from these routed regions, allowing for dense matrix multiplications that are GPU-efficient.
    • This approach achieves computational efficiency and maintains high model performance by dynamically focusing on semantically relevant tokens.
  2. BiFormer Architecture:
    • BiFormer utilizes BRA as a fundamental building block. It is structured in a hierarchical pyramid akin to recent state-of-the-art transformers, making it adaptable for varied applications including image classification, object detection, and semantic segmentation.
    • The hierarchical design incorporates overlapped patch embeddings, convolutional positional encoding, and an optimized number of partitioned regions and routed keys to balance computational load and performance.

Empirical Evaluation

The empirical results demonstrate that BiFormer, leveraging BRA, significantly outperforms existing methods across multiple vision tasks. Key findings include:

  • Image Classification: BiFormer achieves 83.8% top-1 accuracy on ImageNet-1K with a moderate model size of 4.5G FLOPs, outperforming other competitive models such as QuadTree and WaveViT.
  • Object Detection and Instance Segmentation: On COCO 2017, BiFormer shows superior performance in detecting small objects and achieving higher mAP scores compared to state-of-the-art methods like Swin and DAT.
  • Semantic Segmentation: On ADE20K, BiFormer outperforms models with similar configuration, like CSWin and WaveViT, showcasing the effectiveness of BRA in dense prediction tasks.

Implications and Future Directions

The introduction of BRA offers a substantial improvement in computational efficiency while preserving high predictive accuracy in vision transformers. This innovation could lead to more effective deployment of transformer architectures in real-time and resource-constrained environments. Moreover, the concept of dynamic routing at a coarse-to-fine granularity opens new avenues for further exploration and refinement of sparse attention mechanisms.

Future directions may involve adapting BRA to other domains beyond vision, optimizing kernel fusion techniques for reducing GPU overhead, and exploring hybrid architectures that leverage BRA alongside other efficient transformer variants.

In summary, the paper presents a significant advancement in the design of efficient vision transformers, providing a robust framework for enhancing attention mechanisms, and setting a precedent for future innovations in sparse, dynamic attention models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Lei Zhu (280 papers)
  2. Xinjiang Wang (32 papers)
  3. Zhanghan Ke (12 papers)
  4. Wayne Zhang (42 papers)
  5. Rynson Lau (9 papers)
Citations (362)