Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models (2401.12522v2)

Published 23 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.

Introduction

In recent advancements within the field of LLMs, there has been a marked shift towards enhancing the efficiency of these models. Despite their powerful generative capabilities, LLMs face challenges regarding inference latency, particularly in resource-constrained environments. The Bi-directional Tuning for lossless Acceleration (BiTA), a pioneering solution, has been introduced to expedite the inference process for LLMs using a methodology rooted in semi-autoregressive generation and a verification paradigm that maintains output fidelity to autoregressive (AR) generation.

Acceleration Techniques

LLM acceleration techniques range from model compression and architecture simplification to more intricate algorithmic modifications. A significant strand within these techniques involves efficient decoding methods that aim for speed without conceding output quality. Amongst these, semi-autoregressive (SAR) decoding has emerged as a promising paradigm for reducing inference executions. SAR decoding diverges from the conventional AR generation by decoding output tokens in parallel, but it introduces the challenge of SAR models often suffering from quality degradation when compared to their AR counterparts.

Methodology

BiTA introduces a dual-component system to address these challenges. Firstly, a parameter-efficient tuning method inspired by prompt tuning is employed to enable SAR generation, metaphorically described as learnable prefix and suffix embeddings in token sequences. Secondly, an efficient tree-based decoding mechanism facilitates both generation and verification of draft candidates. This allows for swift and simultaneous operations without necessitating additional validation steps or exterior models.

The paper's proposed method magnificently achieves a substantial speedup, as demonstrated with the LLaMA-2-70B-Chat model which sees a 2.7× acceleration on the MT-Bench benchmark. This is attained with a negligible increase in trainable parameters, introducing as few as 0.01% additional parameters, which acts as a testament to the efficiency and innovativeness of the proposed approach.

Experimental Results

BiTA's impact was measured across a spectrum of LLMs and tasks, showcasing a consistent speedup ranging from 2.1× to 3.3×. The performance benefits were found to be particularly pronounced in larger models, possibly due to richer embedding contexts which enhance prediction capabilities. The paper also outlines a stark improvement over state-of-the-art speculative decoding techniques, substantiating BiTA's potential in practical application scenarios. Furthermore, through a series of ablation studies, the researchers elucidated the impact of various prompting designs and configurations on the speedup performance, establishing the superiority of the bi-directional tuning and efficient tree-based decoding strategies.

Conclusion

BiTA's methodology holds considerable potential for advancing the application of LLMs in real-time and constrained resource scenarios, offering a compelling acceleration solution without compromising the generativity and integrity of the model outputs. This work not only contributes to the ongoing endeavor of improving LLM efficiency but also extends the utility of these models, reinforcing their applicability across various domains and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023.
  4. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  7. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  8. Breaking the sequential dependency of llm inference using lookahead decoding, November 2023.
  9. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  10. Fully non-autoregressive neural machine translation: Tricks of the trade. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 120–133, 2021.
  11. Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR), 2018.
  12. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023.
  13. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  14. Speed: Speculative pipelined execution for efficient decoding. arXiv preprint arXiv:2310.12072, 2023.
  15. Directed acyclic transformer for non-autoregressive machine translation. In International Conference on Machine Learning, pages 9410–9428. PMLR, 2022.
  16. Joao Gante. Assisted generation: a new direction toward low-latency text generation, 2023.
  17. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  18. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023.
  19. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  20. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  21. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
  22. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
  23. Gpt understands, too. AI Open, 2023.
  24. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023.
  25. Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234, 2023.
  26. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  27. Pass: Parallel speculative sampling. arXiv preprint arXiv:2311.13581, 2023.
  28. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807. Association for Computational Linguistics, 2018.
  29. Alessandro Palla. chatbot instruction prompts. https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts, 2023.
  30. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  31. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  32. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  33. Accelerating llm inference with staged speculative decoding. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  34. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  35. Spectr: Fast speculative decoding via optimal transport. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  37. Semi-autoregressive neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 479–488, 2018.
  38. Lightseq: A high performance inference library for transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pages 113–120, 2021.
  39. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
  40. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925, 2023.
  41. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  42. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  43. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023.
  44. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  45. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023.
  46. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Feng Lin (89 papers)
  2. Hanling Yi (10 papers)
  3. Hongbin Li (71 papers)
  4. Yifan Yang (578 papers)
  5. Xiaotian Yu (9 papers)
  6. Guangming Lu (49 papers)
  7. Rong Xiao (44 papers)
Citations (3)