LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models

Published 30 Aug 2023 in cs.CL and cs.AI | (2308.16137v7)

Abstract: Today's LLMs typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7x decoding speed up and 7.5x memory saving over the original model. Our codes are released at \url{https://github.com/Glaciohound/LM-Infinite}.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (26)

View on Semantic Scholar

Summary

The paper demonstrates that LM-Infinite significantly improves zero-shot generalization by mitigating attention logit explosions and runaway entropy.
The methodology employs a Λ-shaped attention mask and distance ceiling to preserve critical sequence elements and manage computational limits.
Empirical evaluations show a 2.7x speedup in decoding and a 7.5x reduction in memory usage, with perplexity maintained on inputs up to 200M tokens.

Zero-Shot Extreme Length Generalization for LLMs

This paper addresses the complexities faced by LLMs when handling lengthy inputs, which exceed traditional training limits due to the quadratic computational cost of Transformer architectures. Typically trained on short text segments (<4K tokens), LLMs struggle with inputs requiring extended context, impacting their applicability in domains like scientific article processing or code repository management. This paper proposes a novel method, LM-Infinite, to enhance zero-shot length generalization for LLMs without requiring parameter updates.

Key Contributions

Theoretical Insights: The paper identifies three primary factors contributing to LLMs' failure in generalizing over longer inputs:
- Unseen Distances: The attention logits explode for distances not encountered during training.
- Attention Span: With increasing tokens, the entropy in attention distributions grows indefinitely.
- Initial Tokens Importance: Starting tokens hold distinguished computational features critical for attention outputs.

These elements can make computational features deviate from training distributions, compromising model performance.

LM-Infinite Methodology:
- Λ-shaped Attention Mask: This technique ensures that models only focus on the initial and most recent tokens, effectively tackling runaway entropy and preserving crucial early sequence information.
- Distance Ceiling: Capping attention distances to training-level maximums prevents logit explosions.
- Optionally, middle tokens can be reintroduced based on top-k attention logits, enhancing downstream performance.

Empirical Evidence

The effectiveness of LM-Infinite is substantiated through thorough empirical evaluations across various benchmarked datasets:

Perplexity Evaluation: Models incorporated with LM-Infinite exhibit consistent perplexity at token lengths up to 200M, significantly outperforming baseline LLMs and even certain models trained explicitly on longer sequences.
Downstream Task Performance: For retrieval and complex question-answering tasks, LM-Infinite achieves substantial improvements, highlighting its practical benefits in real-world scenarios.
Efficiency Gains: LM-Infinite delivers a notable 2.7x speedup in decoding and a 7.5x memory reduction over baseline models, facilitating resource-efficient implementations.

Implications and Future Work

The research presents critical advancements for LLMs, enabling them to manage much longer contexts effectively, broadening their scope for runtime applications without necessitating additional training. It underscores the potential of LLMs to transition beyond conventional constraints with minimal computational overhead.

Future explorations may explore adaptive mechanisms for selectively introducing middle tokens and optimizing ceiling parameters further. Investigating LM-Infinite's application during fine-tuning phases or across varying Transformer architectures might offer additional insights. Efforts focused on proprietary LLMs could extend this work’s applicability across diverse AI systems.

By efficiently harnessing the latent representational capabilities of LLMs, LM-Infinite stands to significantly enhance both theoretical understanding and practical deployment of LLMs in tasks necessitating robust long-context processing.

Markdown Report Issue