Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks (2403.13112v3)

Published 19 Mar 2024 in cs.CL

Abstract: Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks where multiple outputs are required for a single shared input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding and increasing the operational intensity (ratio of numbers of arithmetic operation to memory access) of decoding process by sharing the input key-value cache. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks, with comparable or better performance.

References (48)

Summary

The paper presents Prompt-in-Decoder (PiD), a novel decode-in-parallel strategy that reduces redundant input encoding by shifting prompt placement to the decoder.
The paper demonstrates significant computational gains, achieving up to a 4.6x speed-up in inference efficiency while lowering memory access in critical attention mechanisms.
The paper validates PiD’s adaptability across various NLP tasks, including dialogue tracking, summarization, and question-answering, while maintaining competitive performance.

Efficient Multi-Prompt Decoding with Prompt-in-Decoder (PiD)

Transformers have redefined the landscape of NLP with their unparalleled ability to handle a wide array of tasks. However, their significant computational costs have often been a stumbling block, especially in deployment scenarios where resources are constrained or efficiency is paramount. Encoder-decoder models, while versatile, exacerbate this issue when tasked with generating multiple outputs from a single input, as is common in specialized domains. To address this, we introduce a novel decode-in-parallel strategy, named Prompt-in-Decoder (PiD), designed to augment the efficiency of encoder-decoder models significantly.

PiD revisits the conventional practice of embedding prompts within the encoder (Prompt-in-Encoder, or PiE), which necessitates redundant encoding of the input for each prompt, thereby inflating computational costs. Instead, PiD positions prompts within the decoder, enabling a singular encoding of the input that can be utilized across multiple decode tasks. This shift not only reduces the encoding overhead but also minimizes memory access, leading to noteworthy computational savings.

Operational Efficiency of PiD

Through comprehensive performance analysis, PiD demonstrates superior operational efficiency over the PiE configuration across various aspects of the encoder-decoder framework. For instance, PiD drastically reduces the encoder's memory access and operations by encoding the input once, irrespective of the number of prompts, as opposed to PiE that scales these metrics with the number of prompts. Similarly, while both PiD and PiE exhibit comparable operational intensity in the decoder's self-attention, PiD markedly diminishes memory access in the decoder's cross-attention by leveraging shared input embeddings for each prompt. This strategic reduction in memory access enables PiD to achieve significant computational reductions, roughly scaling with the number of subtasks, and up to 4.6x speed-up in inference efficiency.

Extensive Evaluations and Implications

Our evaluation of PiD spans across diverse NLP tasks — from dialogue state tracking to abstractive summarization and question-answering — showcasing its adaptability and effectiveness. PiD strikes an impressive balance between computational efficiency and task performance, achieving comparable or better performance metrics than state-of-the-art models with considerably lower computational costs. This efficiency is particularly pronounced in real-world scenarios where multiple queries are made over the same document, as PiD's encode-once-decode-in-parallel strategy minimizes redundant computations.

Future Directions

While PiD introduces a significant leap towards computational efficiency in handling multi-query tasks, it opens avenues for further exploration. Its adaptability to decoder-only models, integration with model compression techniques, and extension to a broader range of tasks with structured outputs present exciting research prospects. Furthermore, exploring automated subtasking strategies can potentially extend PiD's applicability and enhance its performance.

In conclusion, PiD emerges as a promising approach that redefines efficiency in the computational landscape of NLP, enabling faster and more cost-effective deployment of transformer-based models in resource-constrained scenarios. By mitigating the computational burdens of traditional encoder-decoder models, PiD paves the way for more scalable and accessible NLP solutions across a myriad of applications.