Learning to Compress Prompts with Gist Tokens (2304.08467v3)

Published 17 Apr 2023 in cs.CL

Abstract: Prompting is the primary way to utilize the multitask capabilities of LLMs (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for compute efficiency. Gist models can be trained with no additional cost over standard instruction finetuning by simply modifying Transformer attention masks to encourage prompt compression. On decoder (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting enables up to 26x compression of prompts, resulting in up to 40% FLOPs reductions, 4.2% wall time speedups, and storage savings, all with minimal loss in output quality.

PDF Abstract

Learning to Compress Prompts with Gist Tokens: An Overview

The paper "Learning to Compress Prompts with Gist Tokens" by Jesse Mu, Xiang Lisa Li, and Noah Goodman addresses the efficiency challenges in using LLMs (LMs) through prompt compression. The primary innovation, dubbed "gisting," compresses prompts into smaller, reusable sets of "gist" tokens, which can be cached. This approach offers computational advantages without substantial performance loss, achieving compression rates of up to 26x on both decoder-only (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs.

Introduction and Motivation

Traditional prompting methods occupy valuable input context window space and exhibit inefficiencies when encoded repeatedly. Alternative approaches like finetuning and distillation necessitate retraining for each task, which is impractical. The paper proposes gisting to address these inefficiencies by training LMs to compress prompts into gist tokens which significantly save on computational resources while maintaining multitask capabilities.

Methodology

The core contribution of the paper is the introduction of gisting achieved by modifying the Transformer attention masks. This method trains LMs to predict gist prefixes zero-shot from given prompts, eliminating the need for continual retraining for each task. The gist tokens, inserted into the model's vocabulary, allow arbitrary prompts to be condensed and effectively reused. The attention mask modifications ensure tokens following the gist tokens in the sequence do not attend to tokens before the gist tokens, thus enforcing the compression. Notably, this adaptation incurs no additional cost over standard instruction finetuning.

Experimental Evaluation

Experiments conducted on LLaMA-7B and FLAN-T5-XXL demonstrate the efficacy of gisting:

Prompt Compression: Gisting achieves up to 26x prompt compression with marginal losses in output quality, as measured by ROUGE-L and ChatGPT evaluation metrics.
Computational Efficiency: The approach leads to significant reductions in both GFLOPs (up to 40%) and wall-clock time (up to 4.2%), demonstrating practical computational benefits.
Human and AI-Assisted Evaluations: The paper utilizes both metrics and human evaluation to assess model performance and found that gist models exhibit competitive outputs even with unseen instructions, confirming the generalization capabilities of the compressed representations.

Efficiency Gains

The primary advantage of gisting is the reduction in resource usage during model inference by:

Lower Memory Usage: The smaller gist tokens reduce the demands on memory, making the approach scalable for systems with restricted resources.
Enhanced Prompt Caching: The compression allows more efficient storage and retrieval of prompts. In comparison to traditional prompt caching strategies, gist caching can compress storage requirements by one order of magnitude (26x), facilitating the handling of numerous prompts more effectively.

Context Distillation Perspective

The authors offer an alternative perspective by framing gisting as context distillation across a distribution of tasks. Unlike previous work that distills individual tasks, gisting generalizes across multiple tasks, predicting the gist tokens instead of relying on gradient descent for each task, akin to methods like HyperTuning but focusing on zero-shot generalization from instructions.

Limitations and Future Directions

While the current work shows promising results, some limitations and potential future research directions include:

Edge Cases and Nuance Loss: Certain nuances in specific prompts may be lost during compression, as seen in some failure cases involving precise definitions or repetitive behavior.
Prompt Length Variation: Gisting currently handles prompt lengths well within normal usage limits, but future work could extend to extremely long prompts typical in few-shot learning contexts.
Model Freeze and External Gist Generators: Investigating methods to generate gist tokens using external models without the need for finetuning large LMs could provide more flexibility.

Conclusion

Gisting presents an efficient, cost-effective method for compressing prompts in LMs, effectively balancing the trade-off between computational efficiency and performance. This innovation opens pathways for large-scale deployment of LMs with reduced computational overhead, making it feasible to handle expansive prompt-based applications more resourcefully. Future research could expand on these foundational insights to refine and broaden prompt compression techniques further.

The findings significantly contribute to improving the practical utility and scalability of LMs in real-world applications, demonstrating a compelling approach to optimizing prompt-based interactions.