Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play (2411.00062v3)
Abstract: Current reinforcement learning (RL) frameworks for LLMs (LLM) post-training typically assume a fixed prompt distribution, which is sub-optimal and bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals. This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses. eva is the first method that allows LLMs to adaptively create training prompts in both offline and online RL post-training. The design is simple, easy-to-use yet remarkably effective: eva sets a new SOTA on challenging benchmarks, without any extra human prompts, e.g. it boosts the win-rate of gemma-2-9b-it on Arena-Hard by 51.6% -> 60.1% for DPO and 52.6% -> 62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger. Extensive experiments show eva can create effective RL curricula and is robust across ablations. We believe adaptively evolving prompts are key to designing the next-generation RL post-training scheme.
- Loss of plasticity in continual deep reinforcement learning. In Conference on Lifelong Learning Agents, pages 620–636. PMLR, 2023.
- Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
- Learning to give checkable answers with prover-verifier games. arXiv preprint arXiv:2108.12099, 2021.
- On the difficulty of warm-starting neural network training. arXiv preprint arXiv:1910.08475, 2019.
- Alfredo Banos. On pseudo-games. The Annals of Mathematical Statistics, 39(6):1932–1945, 1968.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
- Refining Minimax Regret for Unsupervised Environment Design. arXiv preprint arXiv:2402.12284, 2024.
- Seth Chaiklin et al. The zone of proximal development in Vygotsky’s analysis of learning and instruction. Vygotsky’s educational theory in cultural context, 1(2):39–64, 2003.
- Self-Improving Robust Preference Optimization. arXiv preprint arXiv:2406.01660, 2024.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- Provably sample efficient rlhf via active preference optimization. arXiv preprint arXiv:2402.10500, 2024.
- Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278, 1997.
- Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33:13049–13061, 2020.
- Maintaining plasticity in deep continual learning. arXiv preprint arXiv:2306.13812, 2023.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- RLHF Workflow: From Reward Modeling to Online RLHF, 2024.
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
- Ky Fan. Minimax theorems. Proceedings of the National Academy of Sciences, 39(1):42–47, 1953.
- Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. arXiv preprint arXiv:2404.01413, 2024.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532, 2023.
- Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639, 2023.
- Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2(4):5, 2024.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Open-Endedness is Essential for Artificial Superhuman Intelligence. arXiv preprint arXiv:2406.04268, 2024a.
- Open-Endedness is Essential for Artificial Superhuman Intelligence. arXiv preprint arXiv:2406.04268, 2024b.
- Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
- Replay-guided adversarial environment design. Advances in Neural Information Processing Systems, 34:1884–1897, 2021a.
- Prioritized level replay. In International Conference on Machine Learning, pages 4940–4950. PMLR, 2021b.
- Ordered sgd: A new stochastic optimization framework for empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pages 669–679. PMLR, 2020.
- John Maynard Keynes. A treatise on probability. Courier Corporation, 1921.
- Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024a.
- From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline. arXiv preprint arXiv:2406.11939, 2024b.
- Skywork Reward Model Series. https://huggingface.co/Skywork, September 2024. URL https://huggingface.co/Skywork.
- On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126, 2024.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv preprint arXiv:2405.14734, 2024.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
- Active Preference Learning for Large Language Models. arXiv preprint arXiv:2402.08114, 2024.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
- John F Nash et al. Non-cooperative games. Princeton University, 1950.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Iterative reasoning preference optimization. arXiv preprint arXiv:2404.19733, 2024.
- Evolving curricula with regret-based environment design. In International Conference on Machine Learning, pages 17473–17498. PMLR, 2022.
- Learning Formal Mathematics From Intrinsic Motivation. arXiv preprint arXiv:2407.00695, 2024.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences. arXiv preprint arXiv:2404.03715, 2024.
- A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
- Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3):210–229, 1959.
- MAESTRO: Open-ended environment design for multi-agent reinforcement learning. arXiv preprint arXiv:2303.03376, 2023.
- Leonard J Savage. The theory of statistical decision. Journal of the American Statistical association, 46(253):55–67, 1951.
- Jürgen Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, 1991.
- Hans-Paul Schwefel. Evolutionsstrategien für die numerische Optimierung. Springer, 1977.
- Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
- Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Reward is enough. Artificial Intelligence, 299:103535, 2021.
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
- Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
- Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768, 2011.
- Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
- Open-ended learning leads to generally capable agents. arXiv preprint arXiv:2107.12808, 2021.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
- Iterative DPO Alignment. Technical report, Snorkel AI, 2023.
- Will we run out of data? Limits of LLM scaling based on human-generated data. arXiv preprint arXiv:2211.04325, 2024.
- Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts. arXiv preprint arXiv:2406.12845, 2024.
- Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv preprint arXiv:1901.01753, 2019.
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Double thompson sampling for dueling bandits. Advances in neural information processing systems, 29, 2016.
- Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
- Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In Forty-first International Conference on Machine Learning, 2024.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023b.
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. arXiv preprint arXiv:2406.08464, 2024.
- To repeat or not to repeat: Insights from scaling llm under token-crisis. Advances in Neural Information Processing Systems, 36, 2024.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
- Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
- Toward Optimal LLM Alignments Using Two-Player Games. arXiv preprint arXiv:2406.10977, 2024.