Large Language Models Can Self-Improve in Long-context Reasoning (2411.08147v1)

Published 12 Nov 2024 in cs.CL and cs.AI

Abstract: LLMs have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. Existing approaches typically involve fine-tuning LLMs with synthetic data, which depends on annotations from human experts or advanced models like GPT-4, thus restricting further advancements. To address this issue, we investigate the potential for LLMs to self-improve in long-context reasoning and propose \ours, an approach specifically designed for this purpose. This approach is straightforward: we sample multiple outputs for each question, score them with Minimum Bayes Risk, and then apply supervised fine-tuning or preference optimization based on these outputs. Extensive experiments on several leading LLMs demonstrate the effectiveness of \ours, with an absolute improvement of $4.2$ points for Llama-3.1-8B-Instruct. Furthermore, \ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models. We anticipate that this work will open new avenues for self-improvement techniques in long-context scenarios, which are essential for the continual advancement of LLMs.

PDF HTML Abstract

Self-Improvement in Long-Context Reasoning with LLMs

The paper "LLMs Can Self-Improve in Long-context Reasoning" presents an innovative approach to enhancing the reasoning capabilities of LLMs within long-context environments. The authors introduce a method called Blosom, designed to facilitate self-improvement in LLMs without heavy reliance on human-generated annotations or advanced models such as GPT-4.

Overview of Blosom

The proposed Blosom method is founded on the premise that LLMs can utilize their inherent abilities to generate and refine knowledge over extended contexts. The core procedure is straightforward: for a given query, multiple outputs are generated, scored via Minimum Bayes Risk (MBR), and subsequently used for supervised fine-tuning or preference optimization. This process helps circumvent the dependency on external data annotation, leveraging self-generated outputs to steer model improvements.

Methodological Approach

Blosom first employs multiple sampling techniques to generate outputs for each query-context pair. The outputs are then evaluated using a scoring system rooted in MBR, which favors outputs that exhibit higher consistency with the majority of other samples. The authors argue that this scoring mechanism effectively filters out pseudo-truths or hallucinations, thus allowing accurate supervision signals to be distilled from the model's outputs. The researchers apply fine-tuning using either a direct supervised approach with high-scoring outputs or preference optimization by contrasting these outputs with lower-scoring ones.

Experimental Results

Comprehensive experiments were executed on several state-of-the-art LLMs, including variants of Qwen-2.5 and Llama-3.1 models. With Blosom, the models showcased an enhancement in long-context reasoning tasks, notably achieving a 4.2-point improvement for Llama-3.1-8B-Instruct. Additionally, Qwen-2.5-14B-Instruct, fine-tuned with Blosom, surpassed its larger 32B counterpart, illustrating the potential efficiency and impact of self-improvement frameworks on model performance. The research highlights performance across diverse datasets, establishing Blosom's robust generalization capabilities.

Implications and Future Directions

The implications of this research are profound, hinting at new horizons for self-improving methodologies beyond the confines of human labeling. In particular, this paper lays the groundwork for further exploration into more sophisticated self-supervision strategies that utilize LLMs' intrinsic reasoning capabilities. Future research could explore optimizing the scoring functions, pushing the boundaries of MBR application, and exploring how such methods can adapt to models with increasingly larger parameters and extended context lengths.

From a practical standpoint, deploying self-improving methods like Blosom for long-context reasoning can markedly reduce the resources needed for training state-of-the-art models, further enhancing their usability, scalability, and efficiency in real-world applications such as multi-document analysis, repository-level coding assistance, and autonomous agent development.

Concluding Remarks

This paper opens an important dialogue in the community about the self-reliant progress of LLMs. Moving forward, fostering self-improvement in AI systems represents a pivotal step towards achieving more autonomous, efficient, and human-like cognition in artificial agents. The authors aptly navigate this complex domain by exploring existing model capacities for self-refinement, paving the way for profound transformations in how we perceive and approach AI enhancement.