Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Abstract: Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing LLMs. In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403 .
- Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems 30.
- Wasserstein generative adversarial networks. In International conference on machine learning. PMLR.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732 .
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 .
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 .
- Emergent complexity via multi-agent competition. In International Conference on Learning Representations.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- bench authors, B. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research .
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning.
- Language models are few-shot learners. Advances in neural information processing systems 33 1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 .
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision .
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 .
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 .
- Visualizing and understanding curriculum learning for long short-term memory networks. arXiv preprint arXiv:1611.06204 .
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 .
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
- Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 .
- Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205 .
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233 .
- Self-training converts weak learners to strong learners in mixture models. In International Conference on Artificial Intelligence and Statistics. PMLR.
- Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and computation 121 256–285.
- A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 119–139.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR.
- A framework for few-shot language model evaluation.
- Generative adversarial nets. Advances in neural information processing systems 27.
- Semi-supervised learning by entropy minimization. Advances in neural information processing systems 17.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 .
- Is multiagent deep reinforcement learning the answer or the question? a brief survey. learning 21 22.
- Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14 2.
- Mistral 7b. arXiv preprint arXiv:2310.06825 .
- Exploiting asymmetry for synthetic training data generation: Synthie and the case of information extraction. arXiv preprint arXiv:2303.04132 .
- Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM) 41 67–95.
- How does semi-supervised learning with pseudo-labelers work? a case study. In The Eleventh International Conference on Learning Representations.
- Self-paced learning for latent variable models. Advances in neural information processing systems 23.
- A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems 30.
- Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Challenges in Representation Learning Workshop.
- Learning the easy things first: Self-paced visual category discovery. In CVPR 2011. IEEE.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems 35 3843–3857.
- Textbooks are all you need ii: phi-1.5 technical report.
- Competition-level code generation with alphacode. Science 378 1092–1097.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 .
- Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241 .
- Curriculum learning for natural answer generation. In IJCAI.
- Competence-based multimodal curriculum learning for medical report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 .
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773 .
- Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in applied probability 29 429–443.
- A generalized training approach for multiagent learning. arXiv preprint arXiv:1909.12823 .
- OpenAI (2023). Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 27730–27744.
- Rephrase, augment, reason: Visual grounding of questions for vision-language models. arXiv preprint arXiv:2310.05861 .
- Language models are unsupervised multitask learners. OpenAI blog 1 9.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290 .
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 .
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 64 99–106.
- Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development 3 210–229.
- Samuel, A. L. (2000). Some studies in machine learning using the game of checkers. IBM Journal of research and development 44 206–226.
- Schapire, R. E. (1990). The strength of weak learnability. Machine learning 5 197–227.
- Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 .
- Mastering the game of go without human knowledge. nature 550 354–359.
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585 .
- Curriculum learning: A survey. International Journal of Computer Vision 130 1526–1565.
- Baby steps: How “less is more” in unsupervised dependency parsing. In NIPS 2009 Workshop on Grammar Induction, Representation of Language and Language Learning.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 3008–3021.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Tesauro, G. et al. (1995). Temporal difference learning and td-gammon. Communications of the ACM 38 58–68.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 .
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944 .
- The alignment handbook. https://github.com/huggingface/alignment-handbook.
- Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 24824–24837.
- Scaling multimodal pre-training via cross-modality gradient harmonization. Advances in Neural Information Processing Systems 35 36161–36173.
- Decoding data quality via synthetic corruptions: Embedding-guided pruning of code data. arXiv preprint arXiv:2312.02418 .
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 .
- Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 .
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 .
- A self-paced multiple-instance learning framework for co-saliency detection. In Proceedings of the IEEE international conference on computer vision.
- An empirical exploration of curriculum learning for neural machine translation. arXiv preprint arXiv:1811.00739 .
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 .
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 .
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.