Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration (2404.12022v1)

Published 18 Apr 2024 in cs.CL

Abstract: LLMs have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa.
  4. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  5. Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  9. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073.
  10. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  11. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  12. Compresso: Structured pruning with collaborative prompting learns compact large language models. arXiv preprint arXiv:2310.05015.
  13. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
  14. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  15. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
  16. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  17. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713.
  18. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  19. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846.
  20. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
  21. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
  22. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
  23. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
  24. Improving language understanding by generative pre-training.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  27. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  28. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  29. Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623.
  30. Llama: Open and efficient foundation language models.
  31. Llama 2: Open foundation and fine-tuned chat models.
  32. Attention is all you need. Advances in neural information processing systems, 30.
  33. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE.
  34. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  35. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
  36. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  37. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  38. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168.
  39. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  40. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Pengfei Wu (18 papers)
  2. Jiahao Liu (72 papers)
  3. Zhuocheng Gong (9 papers)
  4. Qifan Wang (129 papers)
  5. Jinpeng Li (67 papers)
  6. Jingang Wang (71 papers)
  7. Xunliang Cai (63 papers)
  8. Dongyan Zhao (144 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com