Large Language Models Can Self-Improve (2210.11610v2)

Published 20 Oct 2022 in cs.CL

Abstract: LLMs have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement.

PDF Abstract

LLMs Can Self-Improve: A Comprehensive Synthesis

The research paper titled "LLMs Can Self-Improve" presents a novel methodology for enhancing the reasoning capabilities of pre-trained LLMs without relying on ground truth labeled data. The authors propose an innovative self-improvement framework that leverages unlabeled datasets and contrastively pre-trained models.

Core Methodological Contributions

The authors detail a self-improvement process for LLMs, which involves generating high-confidence rationale-augmented answers using Chain-of-Thought (CoT) prompting and self-consistency techniques. These self-generated answers form the target outputs for fine-tuning the LLM, thereby enhancing its reasoning ability. The methodology is described as follows:

Chain-of-Thought Prompting: This approach is employed to facilitate the generation of reasoning paths for each input question. The LLM is prompted to generate multiple reasoning sequences that culminate in a final answer.
Self-Consistency Mechanism: By sampling diverse reasoning paths with a temperature setting greater than zero, the LLM evaluates multiple outputs, ultimately selecting the most consistent answer through a majority voting scheme.
Mixed Formats for Fine-Tuning: The training samples are prepared in mixed formats to prevent overfitting to specific reasoning styles. Four distinct formats are used, encompassing examples with rationale explanation and those with direct answers.

The outlined approach demonstrates significant empirical improvements in the tested LLM, particularly a 540-billion-parameter model, across multiple reasoning benchmarks.

Empirical Evaluation

Striking numerical improvements were observed in the LLM's performance post self-improvement, with strong results on the GSM8K, DROP, OpenBookQA, and ANLI-A3 datasets. The method achieved accuracy improvements such as 74.4% to 82.1% on GSM8K and 90.0% to 94.4% on OpenBookQA without any reliance on ground truth labels. The enhancement of out-of-domain generalization was also notable, as seen in datasets like AQUA and StrategyQA.

Theoretical Implications and Practical Applications

The capacity for self-improvement without labeled data suggests that LLMs can refine their reasoning ability similarly to human metacognition, where individuals engage in self-reflection to improve cognitive skills. The practical implications of this research are substantial, offering pathways to reduce data annotation costs and enabling more scalable and autonomous machine learning systems. This is particularly relevant in real-world applications where labeled data is sparse or expensive to procure.

Speculative Outlook and Future Directions

The advancement of self-improving LLMs hints at a future where AI systems could autonomously refine and adapt to new tasks without human supervision. Future research may explore the combination of self-improvement techniques with existing supervised learning processes to push the boundaries of LLM capabilities even further. Understanding the limitations of self-generated data has implications for ensuring the reliability of such models.

Conclusion

This paper constitutes a significant stride towards realizing autonomous improvement processes in LLMs. By demonstrating that LLMs can self-enhance their reasoning capabilities without annotated datasets, this research opens new vistas for efficient and scalable AI development. As AI systems gravitate towards greater self-reliance, the methodological insights from this paper provide a foundational framework for future explorations into unsupervised model refinement.

In conclusion, the paper offers valuable contributions to the domain of AI and machine learning, with both theoretical and practical advancements that underscore the evolving capabilities of LLMs.