Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 113 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 428 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks (2403.13112v3)

Published 19 Mar 2024 in cs.CL

Abstract: Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
  2. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
  3. Btr: Binary token representations for efficient retrieval augmented language models. arXiv preprint arXiv:2310.01329.
  4. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2):29–35.
  5. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
  6. FiDO: Fusion-in-decoder optimized for stronger performance and faster inference. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11534–11547, Toronto, Canada. Association for Computational Linguistics.
  7. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
  8. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
  9. Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder. arXiv preprint arXiv:2304.04052.
  10. Alexios Gidiotis and Grigorios Tsoumakas. 2020. A divide-and-conquer approach to the summarization of long documents. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:3029–3040.
  11. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
  12. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
  13. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Linguistics.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  15. In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  16. Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099.
  17. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
  18. Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  19. Do we still need clinical language models? In Proceedings of the Conference on Health, Inference, and Learning, volume 209 of Proceedings of Machine Learning Research, pages 578–597. PMLR.
  20. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  21. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  22. Dynamic sparse attention for scalable transformer acceleration. IEEE Transactions on Computers, 71:3165–3178.
  23. Dialgen: Collaborative human-lm generated dialogues for improved understanding of human-human conversations.
  24. Bringing structure into summaries: a faceted summarization dataset for long scientific documents. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1080–1089, Online. Association for Computational Linguistics.
  25. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337.
  26. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  28. Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 9:53–68.
  29. Noam M. Shazeer. 2019. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150.
  30. RadQA: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259, Marseille, France. European Language Resources Association.
  31. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
  32. A comparative analysis of task-agnostic distillation methods for compressing transformer language models. In Conference on Empirical Methods in Natural Language Processing.
  33. Attention is all you need. Advances in Neural Information Processing Systems, 30.
  34. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76.
  35. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  36. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  37. ASSIST: Towards label noise-robust dialogue state tracking. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2719–2731, Dublin, Ireland. Association for Computational Linguistics.
  38. MultiWOZ 2.4: A multi-domain task-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 351–360, Edinburgh, UK. Association for Computational Linguistics.
  39. Xiaoju Ye. 2023. calflops: a flops and params calculate tool for neural networks in pytorch framework.
  40. Flashinfer: Kernel library for llm serving. https://flashinfer.ai/2024/02/02/introduce-flashinfer.html. Accessed: January 31, 2024.
  41. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed: January 31, 2024.
  42. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10(1):586.
  43. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 811–824. IEEE.
  44. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  45. Summn𝑛{}^{n}start_FLOATSUPERSCRIPT italic_n end_FLOATSUPERSCRIPT: A multi-stage summarization framework for long input dialogues and documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1592–1604, Dublin, Ireland. Association for Computational Linguistics.
  46. Description-driven task-oriented dialog modeling. arXiv preprint arXiv:2201.08904.
  47. Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102.
  48. Building blocks for complex tasks: Robust generative event extraction for radiology reports under domain shifts. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 344–357, Toronto, Canada. Association for Computational Linguistics.

Summary

  • The paper presents Prompt-in-Decoder (PiD), a novel decode-in-parallel strategy that reduces redundant input encoding by shifting prompt placement to the decoder.
  • The paper demonstrates significant computational gains, achieving up to a 4.6x speed-up in inference efficiency while lowering memory access in critical attention mechanisms.
  • The paper validates PiD’s adaptability across various NLP tasks, including dialogue tracking, summarization, and question-answering, while maintaining competitive performance.

Efficient Multi-Prompt Decoding with Prompt-in-Decoder (PiD)

Transformers have redefined the landscape of NLP with their unparalleled ability to handle a wide array of tasks. However, their significant computational costs have often been a stumbling block, especially in deployment scenarios where resources are constrained or efficiency is paramount. Encoder-decoder models, while versatile, exacerbate this issue when tasked with generating multiple outputs from a single input, as is common in specialized domains. To address this, we introduce a novel decode-in-parallel strategy, named Prompt-in-Decoder (PiD), designed to augment the efficiency of encoder-decoder models significantly.

PiD revisits the conventional practice of embedding prompts within the encoder (Prompt-in-Encoder, or PiE), which necessitates redundant encoding of the input for each prompt, thereby inflating computational costs. Instead, PiD positions prompts within the decoder, enabling a singular encoding of the input that can be utilized across multiple decode tasks. This shift not only reduces the encoding overhead but also minimizes memory access, leading to noteworthy computational savings.

Operational Efficiency of PiD

Through comprehensive performance analysis, PiD demonstrates superior operational efficiency over the PiE configuration across various aspects of the encoder-decoder framework. For instance, PiD drastically reduces the encoder's memory access and operations by encoding the input once, irrespective of the number of prompts, as opposed to PiE that scales these metrics with the number of prompts. Similarly, while both PiD and PiE exhibit comparable operational intensity in the decoder's self-attention, PiD markedly diminishes memory access in the decoder's cross-attention by leveraging shared input embeddings for each prompt. This strategic reduction in memory access enables PiD to achieve significant computational reductions, roughly scaling with the number of subtasks, and up to 4.6x speed-up in inference efficiency.

Extensive Evaluations and Implications

Our evaluation of PiD spans across diverse NLP tasks — from dialogue state tracking to abstractive summarization and question-answering — showcasing its adaptability and effectiveness. PiD strikes an impressive balance between computational efficiency and task performance, achieving comparable or better performance metrics than state-of-the-art models with considerably lower computational costs. This efficiency is particularly pronounced in real-world scenarios where multiple queries are made over the same document, as PiD's encode-once-decode-in-parallel strategy minimizes redundant computations.

Future Directions

While PiD introduces a significant leap towards computational efficiency in handling multi-query tasks, it opens avenues for further exploration. Its adaptability to decoder-only models, integration with model compression techniques, and extension to a broader range of tasks with structured outputs present exciting research prospects. Furthermore, exploring automated subtasking strategies can potentially extend PiD's applicability and enhance its performance.

In conclusion, PiD emerges as a promising approach that redefines efficiency in the computational landscape of NLP, enabling faster and more cost-effective deployment of transformer-based models in resource-constrained scenarios. By mitigating the computational burdens of traditional encoder-decoder models, PiD paves the way for more scalable and accessible NLP solutions across a myriad of applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 18 likes.

Upgrade to Pro to view all of the tweets about this paper: