Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference (2312.11882v2)

Published 19 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Early Exiting is one of the most popular methods to achieve efficient inference. Current early exiting methods adopt the (weighted) sum of the cross entropy loss of all internal classifiers during training, imposing all these classifiers to predict all instances correctly. However, during inference, as long as one internal classifier predicts an instance correctly, it can accelerate without losing accuracy. Thus, there is a notable gap between training and inference. We propose ConsistentEE, an early exiting method that is consistent in training and inference. ConsistentEE formulates the early exiting process as a reinforcement learning problem. A policy network is added to decide whether an instance should exit or continue. The training objective of ConsistentEE only require each instance to be predicted correctly by one internal classifier. Additionally, we introduce the concept Memorize Layer to measure the hardness of an instance. We incorporate memorized layer into reward function design, which allows "easy" instances to focus more on acceleration while "hard" instances to focus more on accuracy. Experimental results show that our method outperforms other baselines on various natural language understanding and generation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Cross-lingual transfer learning for intent detection of covid-19 utterances.
  2. A Closer Look at Memorization in Deep Networks. stat, 1050: 1.
  3. Language Models are Few-Shot Learners. In NeurIPS.
  4. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM.
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, 4171–4186.
  6. Depth-Adaptive Transformer. In ICLR.
  7. Reducing Transformer Depth on Demand with Structured Dropout. In ICLR.
  8. RomeBERT: Robust Training of Multi-Exit BERT. CoRR, abs/2101.09755.
  9. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. In ICML, 3690–3699.
  10. Transkimmer: Transformer Learns to Layer-wise Skim. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., ACL, 7275–7286.
  11. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
  12. TinyBERT: Distilling BERT for Natural Language Understanding. In EMNLP, Findings of EMNLP, 4163–4174.
  13. Shallow-deep networks: Understanding and mitigating network overthinking. In ICML, 3301–3310.
  14. Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. In ACL-IJCNLP, 6501–6511.
  15. I-BERT: Integer-only BERT Quantization. In ICML, 5506–5518.
  16. Learned Token Pruning for Transformers. In SIGKDD, 784–794.
  17. Self-paced learning for latent variable models. In NeurIPS, 1189–1197.
  18. CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade. In EMNLP, 475–486.
  19. A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models. In NAACL-HLT, 2013–2023.
  20. Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81.
  21. FastBERT: a Self-distilling BERT with Adaptive Inference Time. In ACL, 6035–6044.
  22. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692.
  23. Are sixteen heads really better than one? In Advances in neural information processing systems, 14014–14024.
  24. Training language models to follow instructions with human feedback. In NeurIPS, volume 35, 27730–27744.
  25. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
  26. Confident Adaptive Language Modeling. In NeurIPS.
  27. Consistent Accelerated Inference via Confident Adaptive Transformers. In EMNLP, 4962–4979.
  28. The right tool for the job: Matching model and instance complexities. In ACL, 6640–6651.
  29. Patient Knowledge Distillation for BERT Model Compression. In EMNLP-IJCNLP, 4322–4331.
  30. A simple hash-based early exiting approach for language understanding and generation. In Findings of ACL, 2409–2421.
  31. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6): 7.
  32. An empirical study of example forgetting during deep neural network learning. In ICLR.
  33. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In ACL, 5797–5808.
  34. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
  35. Skipbert: Efficient inference with shallow layer skipping. In ACL, 7287–7301.
  36. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, 38087–38099.
  37. DeeBERT: Dynamic early exiting for accelerating BERT inference. In ACL, 2246–2251.
  38. BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression. In EACL, 91–104.
  39. A Survey on Dynamic Neural Networks for Natural Language Processing. In Findings of EACL, 2325–2336.
  40. Short Text Clustering via Convolutional Neural Networks. In NAACL-HLT, 62–69.
  41. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS, 5754–5764.
  42. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In NeurIPS, 27168–27183.
  43. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In NAACL-HLT, 5798–5809.
  44. BERTScore: Evaluating Text Generation with BERT. In ICLR.
  45. Pcee-bert: Accelerating bert inference via patient and confident early exiting. In Findings of NAACL, 327–338.
  46. Bert loses patience: Fast and robust inference with early exit. In NeurIPS, volume 33, 18330–18341.
  47. Zhu, W. 2021. LeeBERT: Learned Early Exit for BERT with cross-level optimization. In Zong, C.; Xia, F.; Li, W.; and Navigli, R., eds., ACL-IJCNLP, 2968–2980.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ziqian Zeng (32 papers)
  2. Yihuai Hong (6 papers)
  3. Hongliang Dai (13 papers)
  4. Huiping Zhuang (43 papers)
  5. Cen Chen (81 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com