Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference (2306.14393v1)

Published 26 Jun 2023 in cs.CL

Abstract: Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning decisions within these layers through improved $L_0$ regularization. Extensive experiments on GLUE benchmark and SQuAD tasks demonstrate that ToP outperforms state-of-the-art token pruning and model compression methods with improved accuracy and speedups. ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Junyan Li (17 papers)
  2. Li Lyna Zhang (20 papers)
  3. Jiahang Xu (14 papers)
  4. Yujing Wang (53 papers)
  5. Shaoguang Yan (1 paper)
  6. Yunqing Xia (2 papers)
  7. Yuqing Yang (83 papers)
  8. Ting Cao (100 papers)
  9. Hao Sun (383 papers)
  10. Weiwei Deng (29 papers)
  11. Qi Zhang (785 papers)
  12. Mao Yang (62 papers)
Citations (7)