Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding (2402.16844v3)

Published 26 Feb 2024 in cs.LG, cs.AI, and cs.CL
Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Abstract: LLMs have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines LLMs of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small LLM (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM.

LLM-to-SLM: Enhancing Autoregressive Decoding Efficiency with Hybrid LLMs

Introduction to LLM-to-SLM

In the domain of Natural Language Generation (NLG), deploying LLMs efficiently has been a significant challenge, primarily due to their substantial computational demands and the sequential nature of autoregressive decoding. A promising solution to this problem is presented in a paper through a hybrid model approach termed LLM-to-SLM (LLM to Small LLM). This approach capitalizes on the strengths of both large and small models, leveraging the high-quality representation capabilities of LLMs to condition a more computationally efficient SLM for the task of autoregressive generation. The core innovation lies in performing a single pass of encoding with an LLM to guide the generation process of an SLM, striking a balance between maintaining high performance and reducing computational overhead.

Methodology

The paper introduces a novel framework where the encoding capabilities of a pretrained LLM are utilized to generate a comprehensive representation of the input prompt. This representation then conditions an SLM, which is responsible for generating the output sequence. This method significantly reduces the computational burden by limiting the use of the computationally heavy LLM to a single encoding pass, thus delegating the autoregressive decoding to the more efficient SLM.

Key elements of this methodology include:

  • Hybrid Model Architecture: The integration of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families, requiring only fine-tuning of the SLM.
  • Efficiency Gains: Empirical results demonstrate substantial efficiency improvements, achieving speedups of up to 4 times, with only a minor performance decrease in comparison to using an LLM alone.
  • Implementation Details: The LLM-To-SLM utilizes a simple MLP projector to transform the prompt's representation from the LLM's embedding space to that of the SLM, facilitating this hybrid model's autoregressive generation.

Empirical Evaluation

The paper's empirical evaluation spans several benchmarks, including machine translation, summarization, and instruction tuning, across different languages and datasets. The results highlight the method's capability to maintain a close-to-LLM performance while significantly increasing computational efficiency. Notably, in translation and summarization tasks, the LLM-to-SLM configuration achieves speed enhancements by factors of 4.2 and 3.0, respectively, with only a 1 to 2 percent drop in performance metrics.

Theoretical and Practical Implications

The approach underscores a pivotal shift towards more computationally efficient deployment of LLMs, particularly in scenarios where latency and computational resources are limiting factors. Theoretically, it presents a compelling case for the distributed execution of tasks amongst models of varying sizes - a principle that could extend beyond LLMs to other domains within AI. Practically, this method opens up new possibilities for deploying advanced NLG applications on edge devices, where computational resources are scarce.

Future Directions

The paper outlines several areas for future development, including exploring the potential of decoder-only LLMs within this framework, investigating the dynamic invocation of LLMs for further efficiency gains, and extending the approach to models with billions of parameters to understand scalability implications fully. These directions not only promise to refine the LLM-to-SLM approach but also contribute to the broader research landscape on efficient AI model deployment.

Conclusion

This paper introduces a novel method, LLM-to-SLM, that elegantly addresses the computational inefficiencies associated with autoregressive decoding in LLMs. By leveraging the high-quality encodings of an LLM to guide the generation process of an SLM, it achieves significant improvements in speed and efficiency without substantially compromising on performance. As this research area continues to evolve, the LLM-to-SLM method stands as a significant step towards more sustainable and practical applications of LLMs in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Iterative patch selection for high-resolution image recognition. In International Conference on Learning Representations.
  2. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58.
  3. Findings of the 2016 conference on machine translation (wmt16). In First conference on machine translation, pages 131–198. Association for Computational Linguistics.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Medusa: Simple framework for accelerating llm generation with multiple decoding heads.
  6. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
  7. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176.
  8. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  10. Pali: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations.
  11. Flan-alpaca: Instruction tuning from humans and machines.
  12. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  13. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  14. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391. Association for Computational Linguistics.
  15. LLM.int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  16. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  17. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759.
  18. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot.
  19. Breaking the sequential dependency of llm inference using lookahead decoding.
  20. Mask-predict: Parallel decoding of conditional masked language models. In Conference on Empirical Methods in Natural Language Processing.
  21. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
  22. Non-autoregressive neural machine translation. In International Conference on Learning Representations.
  23. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009.
  24. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  25. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  26. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  27. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  28. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In ACL.
  29. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  30. Babyberta: Learning more grammar with small-scale child-directed language. In Conference on Computational Natural Language Learning, pages 624–646.
  31. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795.
  32. Recyclegpt: An autoregressive language model with recyclable module. arXiv preprint arXiv:2308.03421.
  33. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  34. Big little transformer decoder. arXiv preprint arXiv:2302.07863.
  35. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Conference on Empirical Methods in Natural Language Processing.
  36. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  37. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  38. Symbolic chain-of-thought distillation: Small models can also" think" step-by-step. In Findings of the Association for Computational Linguistics.
  39. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  40. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  41. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  42. Gpt understands, too. AI Open.
  43. Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
  44. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  45. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
  46. Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
  47. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337.
  48. OpenAI. 2023. Gpt-4 technical report.
  49. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  50. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5.
  51. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  52. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281.
  53. Improving language understanding by generative pre-training.
  54. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  55. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  56. Accelerating transformer inference for translation via parallel decoding. arXiv preprint arXiv:2305.10427.
  57. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  58. Timo Schick and Hinrich Schütze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  59. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics, pages 7059–7073.
  60. Blockwise parallel decoding for deep autoregressive models. In Conference on Neural Information Processing Systems.
  61. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  62. Fast structured decoding for sequence models. In Conference on Neural Information Processing Systems.
  63. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  64. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  65. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  67. Non-autoregressive machine translation with auxiliary regularization. In AAAI Conference on Artificial Intelligence.
  68. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  69. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922.
  70. Call for papers–the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus. arXiv preprint arXiv:2301.11796.
  71. Imitation learning for non-autoregressive neural machine translation. In Findings of the Association for Computational Linguistics.
  72. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  73. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  74. Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
  75. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  76. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  77. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  78. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  79. Decoding methods in neural language generation: a survey. Information, 12(9):355.
  80. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  81. Tempera: Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations.
  82. Judging llm-as-a-judge with mt-bench and chatbot arena.
  83. Conditional prompt learning for vision-language models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 16816–16825.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Benjamin Bergner (6 papers)
  2. Andrii Skliar (8 papers)
  3. Amelie Royer (11 papers)
  4. Tijmen Blankevoort (37 papers)
  5. Yuki Asano (33 papers)
  6. Babak Ehteshami Bejnordi (19 papers)
Citations (3)