Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks (2403.13112v3)
Abstract: Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, Singapore. Association for Computational Linguistics.
- MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
- Btr: Binary token representations for efficient retrieval augmented language models. arXiv preprint arXiv:2310.01329.
- Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2):29–35.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
- FiDO: Fusion-in-decoder optimized for stronger performance and faster inference. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11534–11547, Toronto, Canada. Association for Computational Linguistics.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
- QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
- Decoder-only or encoder-decoder? interpreting language model as a regularized encoder-decoder. arXiv preprint arXiv:2304.04052.
- Alexios Gidiotis and Grigorios Tsoumakas. 2020. A divide-and-conquer approach to the summarization of long documents. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:3029–3040.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
- Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Linguistics.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099.
- Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3577–3599, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Do we still need clinical language models? In Proceedings of the Conference on Health, Inference, and Learning, volume 209 of Proceedings of Machine Learning Research, pages 578–597. PMLR.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Dynamic sparse attention for scalable transformer acceleration. IEEE Transactions on Computers, 71:3165–3178.
- Dialgen: Collaborative human-lm generated dialogues for improved understanding of human-human conversations.
- Bringing structure into summaries: a faceted summarization dataset for long scientific documents. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 1080–1089, Online. Association for Computational Linguistics.
- Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 9:53–68.
- Noam M. Shazeer. 2019. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150.
- RadQA: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6250–6259, Marseille, France. European Language Resources Association.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288.
- A comparative analysis of task-agnostic distillation methods for compressing transformer language models. In Conference on Empirical Methods in Natural Language Processing.
- Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
- ASSIST: Towards label noise-robust dialogue state tracking. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2719–2731, Dublin, Ireland. Association for Computational Linguistics.
- MultiWOZ 2.4: A multi-domain task-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 351–360, Edinburgh, UK. Association for Computational Linguistics.
- Xiaoju Ye. 2023. calflops: a flops and params calculate tool for neural networks in pytorch framework.
- Flashinfer: Kernel library for llm serving. https://flashinfer.ai/2024/02/02/introduce-flashinfer.html. Accessed: January 31, 2024.
- Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed: January 31, 2024.
- Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10(1):586.
- Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 811–824. IEEE.
- GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
- Summn𝑛{}^{n}start_FLOATSUPERSCRIPT italic_n end_FLOATSUPERSCRIPT: A multi-stage summarization framework for long input dialogues and documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1592–1604, Dublin, Ireland. Association for Computational Linguistics.
- Description-driven task-oriented dialog modeling. arXiv preprint arXiv:2201.08904.
- Atom: Low-bit quantization for efficient and accurate llm serving. arXiv preprint arXiv:2310.19102.
- Building blocks for complex tasks: Robust generative event extraction for radiology reports under domain shifts. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 344–357, Toronto, Canada. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.