Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer (2108.09193v3)

Published 20 Aug 2021 in cs.CL

Abstract: Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by computing sparse self-attention instead of a dense one, which usually attends to tokens at certain positions or randomly selected tokens. However, manually selected or random tokens may be uninformative for context modeling. In this paper, we propose Smart Bird, which is an efficient and effective Transformer with learnable sparse attention. In Smart Bird, we first compute a sketched attention matrix with a single-head low-dimensional Transformer, which aims to find potential important interactions between tokens. We then sample token pairs based on their probability scores derived from the sketched attention matrix to generate different sparse attention index matrices for different attention heads. Finally, we select token embeddings according to the index matrices to form the input of sparse attention networks. Extensive experiments on six benchmark datasets for different tasks validate the efficiency and effectiveness of Smart Bird in text modeling.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chuhan Wu (86 papers)
  2. Fangzhao Wu (81 papers)
  3. Tao Qi (43 papers)
  4. Binxing Jiao (18 papers)
  5. Daxin Jiang (138 papers)
  6. Yongfeng Huang (110 papers)
  7. Xing Xie (220 papers)
Citations (3)