Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.6k 41 4 901

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335v3)

Published 2 Jan 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing LLMs. In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.

PDF HTML Abstract

The paper "Self-Play Fine-Tuning Converts Weak LLMs to Strong LLMs" presents a novel approach for fine-tuning LLMs that eschews additional human-annotated data or external preference feedback. The proposed methodology, termed Self-Play fIne-tuNing ( $SPIN$ ), introduces a self-play mechanism whereby an LLM iteratively refines its performance by engaging with self-generated data derived from previous iterations.

Key Methodological Insights

Supervised Fine-Tuning Without Additional Data: The paper commences with a supervised fine-tuned LLM and improves it without further human-annotated data beyond the initial dataset. This approach is motivated by the limitations and cost associated with acquiring vast amounts of high-quality training data traditionally required for LLMs.
Self-Play Mechanism: The core innovation is the self-play dynamic where the LLM refines itself by playing against previous instantiations. Specifically, during each iteration, the LLM generates synthetic data output, learning to differentiate these self-responses from human-generated ones. This adversarial process drives the model towards enhanced performance, gradually aligning it closer to human-level response distributions.
Iterative Refinement: The process is designed iteratively, enabling the model to build upon improvements from prior iterations. This iterative approach simulates a progressively difficult 'curriculum' for the model, fostering the development of nuanced capabilities progressively.
Convergence to Optimal Policy: The authors theorize and prove that the training objective is globally optimum only when the LLM's response distribution matches the human data distribution, thereby authenticating the model's alignment to optimal behavior through $SPIN$ .

Empirical Evaluation and Results

The paper evaluates the $SPIN$ methodology using recognized benchmark datasets, including the HuggingFace Open LLM Leaderboard and MT-Bench. Remarkable improvements were reported across various evaluation metrics, notably in tasks like GSM8k and TruthfulQA, where performance enhancements exceeded 10%.
The empirical studies underscore that $SPIN$ can surpass models trained through conventional preference optimization techniques, even those incorporating advanced guidance from models like GPT-4. The iterative nature of $SPIN$ ensures continuous improvements over successive rounds until convergence.

Comparisons and Relations to Existing Methods

The proposed $SPIN$ method contrasts with Direct Preference Optimization (DPO) by eliminating dependency on additional preference data. Unlike traditional Reinforcement Learning (RL) approaches such as RLHF (Reinforcement Learning from Human Feedback), $SPIN$ leverages the LLM's intrinsic capabilities, thus reducing operational overhead.
The mechanism resembles objectives in Generative Adversarial Networks (GANs), where the adversarial framework contributes to model robustness. Here, the distinguishing feature is that both discriminator (main player) and generator (opponent) arise from the same LLM at different iterations.

Limitations and Future Directions

The current technique hinges on a fixed target data distribution, suggesting an upper boundary in model performance that aligns with human capabilities. Future work could explore dynamic target distributions or strategies exceeding human reference standards for super-human model achievements.
The paper also highlights the computational implications of synthetic data generation in adversarial self-play settings, suggesting avenues for research into more data-efficient techniques.

This paper contributes significant advancements in autonomous LLM enhancement, offering a viable alternative to resource-intensive model fine-tuning processes and setting a precedent for subsequent research in self-supervised and semi-autonomous machine learning paradigms.

PDF Markdown Bookmark Chat (Pro)

References (90)

Authors (5)

Zixiang Chen (28 papers)
Yihe Deng (16 papers)
Huizhuo Yuan (16 papers)
Kaixuan Ji (11 papers)
Quanquan Gu (198 papers)

Citations (185)

View on Semantic Scholar

GitHub

GitHub - uclaml/SPIN: The official implementation of Self-Play Fine-Tuning (SPIN) (901 stars)

Tweets

https://twitter.com/794433401591693312/status/1742363796557926663

https://twitter.com/2465283662/status/1742372102940839937

https://twitter.com/901303999529312256/status/1742369496420213182

https://twitter.com/1462223072203722756/status/1742372246868623758

https://twitter.com/4654501943/status/1742391053552930993

https://twitter.com/192201556/status/1742390211319906515

YouTube

Show All Videos

HackerNews

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (41 points, 12 comments)