From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data (2406.19292v2)

Published 27 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Recent studies have shown that LLMs struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.

PDF Abstract

Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data: A Technical Overview

The paper "From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data" by Xiong et al., investigates an innovative approach to ameliorate the information retrieval and reasoning capabilities of LLMs in the context of long textual inputs. The paper is notable for its focus on synthetic data for finetuning, which circumvents some of the issues inherent in factual and naturally occurring datasets.

Introduction and Background

LLMs, despite their impressive language understanding abilities, exhibit limitations when tasked with retrieving and processing information across extended context windows. Prior research has identified phenomena such as the "lost-in-the-middle" effect, where retrieval accuracy diminishes for information embedded centrally in a long context. This paper addresses these limitations by introducing a carefully designed synthetic dataset for finetuning, specifically focusing on numerical key-value retrieval tasks, thereby aiming to optimize the models for better performance in long-context settings.

Methodology

Synthetic Datasets

The authors propose two varieties of synthetic retrieval tasks:

Simple Dictionary Key-Value Retrieval: This task involves a straightforward retrieval operation where the model is prompted to identify the value corresponding to a given key within a list of dictionaries.
Multi-Subkey Dictionary Key-Value Retrieval: A more complex task designed for models already proficient in simpler retrieval tasks. Here, each dictionary key consists of a tuple of subkeys, adding a layer of difficulty.

Additionally, the paper introduces the concept of an answer template to streamline the learning process, allowing models to focus on the retrieval task rather than the format of the response.

Experiments and Results

Finetuning Procedure

The paper extensively tests the impact of this finetuning approach on two models: GPT-3.5 Turbo and Mistral 7B. For each model, finetuning was performed on the synthetic dataset, with and without answer templates, and the results were compared against finetuning on practical datasets like MDQA.

Evaluation Metrics

The evaluation metrics focus on the performance of the models on Multi-Document Question Answering (MDQA) and Flexible Length Question Answering (FLenQA) benchmarks, with particular attention to the models' ability to retrieve information located at various positions within a long context (e.g., the beginning, middle, end).

Key Findings

Performance on MDQA: Finetuning on synthetic data notably flattens the U-shaped performance curve typically observed, improving retrieval across the context. For instance, the finetuned GPT-3.5 Turbo showed a $10.5\%$ improvement in retrieval accuracy on the MDQA benchmark when the target information was positioned in the middle of the context.
Long-Context Reasoning (FLenQA): The models demonstrated enhanced reasoning capabilities. For example, finetuning GPT-3.5 Turbo using the answer template improved its accuracy in tasks involving longer contexts, as evidenced by superior performance on the FLenQA dataset.
General Capabilities: The general performance of the models on benchmarks like MMLU, TriviaQA, and NQ-Open remained virtually unaffected or slightly improved, suggesting that the finetuning approach does not degrade the models' overall capabilities.
Baseline Comparisons: The synthetic finetuning approach outperformed other long-context augmentation baselines such as MultidocQA, IN2, and Needle-in-a-haystack on MDQA and FLenQA benchmarks, without the adverse impact on general-purpose benchmarks that these baseline datasets tend to induce.

Implications and Future Work

The paper underscores the efficacy of synthetic datasets in enhancing specific capabilities of LLMs without inducing hallucination effects, which are often a consequence of finetuning on factual data. This approach also affords robustness against outdated information, a critical limitation in rapidly evolving knowledge domains.

Future directions may explore integrating this synthetic data as a component of larger instruction finetuning datasets, evaluating models on complex tasks with both relevant and irrelevant distractors, and extending the methodology to even broader context windows.

Conclusion

The paper by Xiong et al. provides compelling evidence that finetuning LLMs on synthetic data comprising numerical key-value retrieval tasks markedly enhances their information retrieval and reasoning capabilities in long-context scenarios. This approach also preserves the models' general capabilities, presenting a promising avenue for developing more efficient LLMs suited for complex real-world tasks. The implications of this work suggest new pathways for improving LLM performance while mitigating challenges related to data bias and hallucination.