Make Your LLM Fully Utilize the Context (2404.16811v2)

Published 25 Apr 2024 in cs.CL and cs.AI

Abstract: While many contemporary LLMs can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

PDF Abstract

Enhancing Information Utilization in LLMs with Information-Intensive Training

Introduction

The challenges associated with managing long-context information in LLMs have been a notable area of focus in recent research. Traditional training methods often lead to models that perform well with information at the beginning or the end of a context but miss critical details in the middle—a phenomenon referred to as the "lost-in-the-middle" issue. Our paper introduces Information-Intensive (In2) training to address this challenge, focusing on enhancing model capability to handle and integrate information throughout extensive text lengths effectively.

Methodology

Data Construction for In2 Training

The In2 training model leverages a synthesized dataset tailored to improve the model's interaction with long-context information. This dataset promotes:

Fine-grained information awareness: Crafting training pairs that focus on details within a specific segment of the text.
Integration and reasoning across multiple segments: Formulating queries that require synthesizing information from multiple text segments, enhancing the model's ability to integrate and reason over dispersed textual information.

Technical Training Details

Our experimental setup involves fine-tuning a Mistral-7B model using the generated dataset in a setting that mimics real-world applications. The model, rebranded as FilM-7B (Fill-in-the-Middle), underwent rigorous training processes to maximize its capacity for utilizing extensive text inputs effectively.

Performance Evaluations

Probing Tasks

We designed a set of probing tasks—spanning various context styles (document, code, structured-data) and information retrieval patterns (forward, backward, bi-directional)—to rigorously evaluate FilM-7B's performance across diverse scenarios. Detailed investigations show a significant reduction in the "lost-in-the-middle" issue, with FilM-7B demonstrating robust information retrieval capabilities across all parts of the context window.

Applying to Real-world Tasks

Highlighted results include improvements on long-context benchmarks like NarrativeQA, where FilM-7B notably outperforms existing models with an increased F1 score from 23.5 to 26.9. These results affirm the model's enhanced ability not only in narrowly defined test conditions but also in broader, practical text-handling applications.

Implications on Short-context Tasks

Importantly, our approach maintains comparable performance on short-context tasks, ensuring that enhancements in handling longer texts do not detract from the model’s capabilities in shorter scenarios.

Theoretical and Practical Implications

The integration of FilM-7B could have far-reaching implications, particularly in fields requiring detailed analysis of lengthy documents, such as legal reviews, lengthy academic article summarization, and extensive technical manual interpretation. Theoretically, this research pushes forward our understanding of deep learning architectures, stressing the importance of tailored data training sets in overcoming specific neural network training challenges.

Future Research Directions

Continued advancements could explore further optimization of training datasets and expanding the model’s proficiency to more complex or less structured long-context data types. Additional studies might also delve into refining model architectures to enhance processing efficiencies for extremely long texts without requiring proportionate increases in computational resources.

Conclusion

This paper signifies a step forward in enhancing LLMs' proficiency with long texts. By focusing on the specific challenges of middle-text information loss and creating a targeted training approach, we can significantly expand the practical applications and effectiveness of LLMs across a range of disciplines and uses.