Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335v3)

Published 2 Jan 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Harnessing the power of human-annotated data through Supervised Fine-Tuning (SFT) is pivotal for advancing LLMs. In this paper, we delve into the prospect of growing a strong LLM out of a weak one without the need for acquiring additional human-annotated data. We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN), which starts from a supervised fine-tuned model. At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from those obtained from human-annotated data. Our method progressively elevates the LLM from a nascent model to a formidable one, unlocking the full potential of human-annotated demonstration data for SFT. Theoretically, we prove that the global optimum to the training objective function of our method is achieved only when the LLM policy aligns with the target data distribution. Empirically, we evaluate our method on several benchmark datasets including the HuggingFace Open LLM Leaderboard, MT-Bench, and datasets from Big-Bench. Our results show that SPIN can significantly improve the LLM's performance across a variety of benchmarks and even outperform models trained through direct preference optimization (DPO) supplemented with extra GPT-4 preference data. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents. Codes are available at https://github.com/uclaml/SPIN.

The paper "Self-Play Fine-Tuning Converts Weak LLMs to Strong LLMs" presents a novel approach for fine-tuning LLMs that eschews additional human-annotated data or external preference feedback. The proposed methodology, termed Self-Play fIne-tuNing (SPINSPIN), introduces a self-play mechanism whereby an LLM iteratively refines its performance by engaging with self-generated data derived from previous iterations.

Key Methodological Insights

  1. Supervised Fine-Tuning Without Additional Data: The paper commences with a supervised fine-tuned LLM and improves it without further human-annotated data beyond the initial dataset. This approach is motivated by the limitations and cost associated with acquiring vast amounts of high-quality training data traditionally required for LLMs.
  2. Self-Play Mechanism: The core innovation is the self-play dynamic where the LLM refines itself by playing against previous instantiations. Specifically, during each iteration, the LLM generates synthetic data output, learning to differentiate these self-responses from human-generated ones. This adversarial process drives the model towards enhanced performance, gradually aligning it closer to human-level response distributions.
  3. Iterative Refinement: The process is designed iteratively, enabling the model to build upon improvements from prior iterations. This iterative approach simulates a progressively difficult 'curriculum' for the model, fostering the development of nuanced capabilities progressively.
  4. Convergence to Optimal Policy: The authors theorize and prove that the training objective is globally optimum only when the LLM's response distribution matches the human data distribution, thereby authenticating the model's alignment to optimal behavior through SPINSPIN.

Empirical Evaluation and Results

  • The paper evaluates the SPINSPIN methodology using recognized benchmark datasets, including the HuggingFace Open LLM Leaderboard and MT-Bench. Remarkable improvements were reported across various evaluation metrics, notably in tasks like GSM8k and TruthfulQA, where performance enhancements exceeded 10%.
  • The empirical studies underscore that SPINSPIN can surpass models trained through conventional preference optimization techniques, even those incorporating advanced guidance from models like GPT-4. The iterative nature of SPINSPIN ensures continuous improvements over successive rounds until convergence.

Comparisons and Relations to Existing Methods

  • The proposed SPINSPIN method contrasts with Direct Preference Optimization (DPO) by eliminating dependency on additional preference data. Unlike traditional Reinforcement Learning (RL) approaches such as RLHF (Reinforcement Learning from Human Feedback), SPINSPIN leverages the LLM's intrinsic capabilities, thus reducing operational overhead.
  • The mechanism resembles objectives in Generative Adversarial Networks (GANs), where the adversarial framework contributes to model robustness. Here, the distinguishing feature is that both discriminator (main player) and generator (opponent) arise from the same LLM at different iterations.

Limitations and Future Directions

  • The current technique hinges on a fixed target data distribution, suggesting an upper boundary in model performance that aligns with human capabilities. Future work could explore dynamic target distributions or strategies exceeding human reference standards for super-human model achievements.
  • The paper also highlights the computational implications of synthetic data generation in adversarial self-play settings, suggesting avenues for research into more data-efficient techniques.

This paper contributes significant advancements in autonomous LLM enhancement, offering a viable alternative to resource-intensive model fine-tuning processes and setting a precedent for subsequent research in self-supervised and semi-autonomous machine learning paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403 .
  2. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems 30.
  3. Wasserstein generative adversarial networks. In International conference on machine learning. PMLR.
  4. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 .
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 .
  6. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 .
  7. Emergent complexity via multi-agent competition. In International Conference on Learning Representations.
  8. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  9. bench authors, B. (2023). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research .
  10. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning.
  11. Language models are few-shot learners. Advances in neural information processing systems 33 1877–1901.
  12. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 .
  13. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision .
  14. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 .
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
  17. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 .
  18. Visualizing and understanding curriculum learning for long short-term memory networks. arXiv preprint arXiv:1611.06204 .
  19. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 .
  20. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
  21. Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 .
  22. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205 .
  23. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233 .
  24. Self-training converts weak learners to strong learners in mixture models. In International Conference on Artificial Intelligence and Statistics. PMLR.
  25. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and computation 121 256–285.
  26. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 119–139.
  27. Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR.
  28. A framework for few-shot language model evaluation.
  29. Generative adversarial nets. Advances in neural information processing systems 27.
  30. Semi-supervised learning by entropy minimization. Advances in neural information processing systems 17.
  31. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
  32. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 .
  33. Is multiagent deep reinforcement learning the answer or the question? a brief survey. learning 21 22.
  34. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14 2.
  35. Mistral 7b. arXiv preprint arXiv:2310.06825 .
  36. Exploiting asymmetry for synthetic training data generation: Synthie and the case of information extraction. arXiv preprint arXiv:2303.04132 .
  37. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM) 41 67–95.
  38. How does semi-supervised learning with pseudo-labelers work? a case study. In The Eleventh International Conference on Learning Representations.
  39. Self-paced learning for latent variable models. Advances in neural information processing systems 23.
  40. A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems 30.
  41. Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Challenges in Representation Learning Workshop.
  42. Learning the easy things first: Self-paced visual category discovery. In CVPR 2011. IEEE.
  43. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems 35 3843–3857.
  44. Textbooks are all you need ii: phi-1.5 technical report.
  45. Competition-level code generation with alphacode. Science 378 1092–1097.
  46. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 .
  47. Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241 .
  48. Curriculum learning for natural answer generation. In IJCAI.
  49. Competence-based multimodal curriculum learning for medical report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
  50. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583 .
  51. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  52. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773 .
  53. Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in applied probability 29 429–443.
  54. A generalized training approach for multiagent learning. arXiv preprint arXiv:1909.12823 .
  55. OpenAI (2023). Gpt-4 technical report.
  56. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 27730–27744.
  57. Rephrase, augment, reason: Visual grounding of questions for vision-language models. arXiv preprint arXiv:2310.05861 .
  58. Language models are unsupervised multitask learners. OpenAI blog 1 9.
  59. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290 .
  60. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE.
  61. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 .
  62. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 64 99–106.
  63. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development 3 210–229.
  64. Samuel, A. L. (2000). Some studies in machine learning using the game of checkers. IBM Journal of research and development 44 206–226.
  65. Schapire, R. E. (1990). The strength of weak learnability. Machine learning 5 197–227.
  66. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815 .
  67. Mastering the game of go without human knowledge. nature 550 354–359.
  68. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585 .
  69. Curriculum learning: A survey. International Journal of Computer Vision 130 1526–1565.
  70. Baby steps: How “less is more” in unsupervised dependency parsing. In NIPS 2009 Workshop on Grammar Induction, Representation of Language and Language Learning.
  71. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 3008–3021.
  72. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  73. Tesauro, G. et al. (1995). Temporal difference learning and td-gammon. Communications of the ACM 38 58–68.
  74. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 .
  75. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 .
  76. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944 .
  77. The alignment handbook. https://github.com/huggingface/alignment-handbook.
  78. Vapnik, V. (1999). The nature of statistical learning theory. Springer science & business media.
  79. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  80. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/.
  81. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 24824–24837.
  82. Scaling multimodal pre-training via cross-modality gradient harmonization. Advances in Neural Information Processing Systems 35 36161–36173.
  83. Decoding data quality via synthetic corruptions: Embedding-guided pruning of code data. arXiv preprint arXiv:2312.02418 .
  84. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 .
  85. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825 .
  86. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 .
  87. A self-paced multiple-instance learning framework for co-saliency detection. In Proceedings of the IEEE international conference on computer vision.
  88. An empirical exploration of curriculum learning for neural machine translation. arXiv preprint arXiv:1811.00739 .
  89. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685 .
  90. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 .
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zixiang Chen (28 papers)
  2. Yihe Deng (16 papers)
  3. Huizhuo Yuan (16 papers)
  4. Kaixuan Ji (11 papers)
  5. Quanquan Gu (198 papers)
Citations (185)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com