Focused Transformer: Contrastive Training for Context Scaling (2307.03170v2)

Published 6 Jul 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.

PDF HTML Abstract

An Analysis of the Focused Transformer: Context Scaling Through Contrastive Training

The research paper presents the Focused Transformer (FoT), a technique aimed at enhancing the context length of LLMs by addressing the distraction issue that arises in multi-document scenarios. The distraction issue refers to the model's difficulty in distinguishing relevant from irrelevant information as the number of documents in the context increases. The proposed method utilizes a contrastive learning-inspired training process to improve the representation of (key, value) pairs, thereby extending the effective context length of transformer models without altering their architecture.

Methodology

The FoT introduces memory attention layers that utilize k-nearest neighbors (kNN) to access additional (key, value) contexts during inference. This approach enables the model to retrieve relevant information from a large memory, effectively extending its usable context length. The memory attention is integrated differently from previous approaches, eschewing gating mechanisms in favor of simpler, and potentially more effective, methods.

The crossbatch training procedure is a significant innovation of FoT. It allows the model to differentiate between relevant and irrelevant keys by exposing the attention layers to both pertinent context and unrelated contexts (negatives). This exposure is achieved in a differentiable manner, allowing the model to fine-tune its key, value, and query structures iteratively.

Results and Discussion

The authors demonstrate the efficacy of FoT through the development of LongLLaMA models, which are fine-tuned versions of OpenLLaMA models. These improved models show significant advances in tasks requiring extended contexts, such as passkey retrieval tasks, reaching token lengths of up to 256k. In few-shot learning tasks on datasets like TREC and WebQS, LongLLaMA models exhibit marked improvements when provided with more demonstration examples within the extended context.

The paper also highlights FoT's ability to fine-tune existing models for longer contexts without new architecture modifications. This approach distinguishes FoT from other methods by leveraging the existing model's capabilities and extending them through efficient fine-tuning strategies.

Theoretical Implications

FoT addresses a critical challenge in scaling transformers for extensive contexts, namely the distraction issue. By employing contrastive learning elements, the model develops a structured key space better suited for long-context tasks. This method not only aligns with existing literature on contrastive learning but also extends its application to transformer architecture, opening avenues for future research in adaptable long-context models.

Future Directions

The scalability of FoT suggests potential integration with approximate kNN search methods for further efficiency. Moreover, exploring the combination of FoT with other long-context techniques, such as positional interpolation methods, could yield additional improvements. The crossbatch training approach could be refined by incorporating advanced contrastive learning techniques to better handle memory context structuring.

In conclusion, the Focused Transformer offers a compelling approach to extending the context length of LLMs through contrastive-inspired training. Its simplicity in implementation and effectiveness in extending context without architectural changes make it a promising addition to the toolbox for scaling transformer models in multi-document environments.