Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Teaching Large Language Models to Reason with Reinforcement Learning (2403.04642v1)

Published 7 Mar 2024 in cs.LG
Teaching Large Language Models to Reason with Reinforcement Learning

Abstract: Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $106$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.

Enhancing LLMs' Reasoning Capabilities with Reinforcement Learning

Performance of Reinforcement Learning Algorithms on LLM Reasoning Tasks

In the paper conducted by Havrilla et al., multiple reinforcement learning (RL) algorithms were examined for their effectiveness in amplifying the reasoning capabilities of LLMs. The paper meticulously compared Expert Iteration (EI), Proximal Policy Optimization (PPO), and Return-Conditioned Reinforcement Learning (RCRL) across various settings, involving different rewards structures, model sizes, and initializations, both with and without previously fine-tuned data. Notably, EI consistently emerged as the superior approach in most scenarios, with its performance closely rivaling that of PPO, demonstrating a similar degree of sample efficiency, which is contrary to conventional expectations in traditional RL applications.

Methodological Insights

Reinforcement Learning Formulation for Reasoning

The researchers adeptly formulated reasoning tasks as an RL problem by considering the Markov Decision Process (MDP) framework, applied to question-answer tuples. This innovative approach facilitated the application of RL algorithms to refine the LLMs' reasoning processes, employing both sparse and dense rewards.

Algorithm Comparisons and Performance Metrics

EI, PPO, and RCRL were scrutinized across four primary performance metrics: maj@1, maj@96, rerank@96, and pass@96 scores. Interestingly, despite the varying complexity and theoretical advantages of these algorithms under different conditions, EI displayed superior performance across most metrics. A crucial finding was the similar sample efficiency of EI and PPO, challenging the prevalent notion of PPO's superior efficiency in complex environments, arguably due to the deterministic nature of the reasoning tasks and the influence of LLM pretraining.

Implications and Future Directions

Exploration Limitations and Role of Pretraining

A significant observation was the models' apparent lack of deep exploration beyond the purviews of SFT models or pretraining, suggesting a strong reliance on previously learned patterns. This underscores the critical role of pretraining in shaping LLMs' capabilities and highlights a potential bottleneck for further enhancements through RL, limited by the extent of exploration.

Theoretical and Practical RL Considerations

The paper draws attention to the contextual performance of different RL algorithms, suggesting that environments with deterministic dynamics, such as reasoning tasks, may not fully leverage the intricate mechanisms of algorithms like PPO designed for stochastic settings. Additionally, the findings advocate for a broader exploration strategy to transcend the boundaries established by pretraining and fine-tuning, possibly through more sophisticated prompting strategies or hybrid models combining evolution-based methods with LLM generative powers.

Concluding Remarks

Havrilla et al.'s exploration into using RL for refining LLM reasoning ability delivers insightful comparisons across leading algorithms while exemplifying the critical influence of LLM pretraining. The convergence in performance between EI and PPO, despite their theoretical divergences, points to the nuanced interplay between algorithmic efficiency and the foundational role of pretraining in LLM task performance. As we look toward future developments, the quest for enhanced reasoning capabilities in AI may well depend on innovative strategies that promote genuine exploration and learning beyond the confines of existing knowledge, potentially reshaping our approach to AI reasoning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Playing text-adventure games with graph-based deep reinforcement learning. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3557–3565, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. 10.18653/v1/N19-1358. URL https://aclanthology.org/N19-1358.
  2. Thinking fast and slow with deep learning and tree search. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:19449905.
  3. Program synthesis with large language models. ArXiv, abs/2108.07732, 2021. URL https://api.semanticscholar.org/CorpusID:237142385.
  4. Llemma: An open language model for mathematics. ArXiv, abs/2310.10631, 2023. URL https://api.semanticscholar.org/CorpusID:264172303.
  5. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073, 2022. URL https://api.semanticscholar.org/CorpusID:254823489.
  6. Human-level play in the game of ¡i¿diplomacy¡/i¿ by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. 10.1126/science.ade9097. URL https://www.science.org/doi/abs/10.1126/science.ade9097.
  7. Dota 2 with large scale deep reinforcement learning. ArXiv, abs/1912.06680, 2019. URL https://api.semanticscholar.org/CorpusID:209376771.
  8. Graph of thoughts: Solving elaborate problems with large language models. ArXiv, abs/2308.09687, 2023. URL https://api.semanticscholar.org/CorpusID:261030303.
  9. Sequence modeling is a robust contender for offline reinforcement learning. ArXiv, abs/2305.14550, 2023. URL https://api.semanticscholar.org/CorpusID:258866105.
  10. When does return-conditioned supervised learning work for offline reinforcement learning? ArXiv, abs/2206.01079, 2022. URL https://api.semanticscholar.org/CorpusID:249282285.
  11. Grounding large language models in interactive environments with online reinforcement learning. ArXiv, abs/2302.02662, 2023. URL https://api.semanticscholar.org/CorpusID:256615643.
  12. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235294299.
  13. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv, abs/2211.12588, 2022.
  14. François Chollet. On the measure of intelligence. ArXiv, abs/1911.01547, 2019. URL https://api.semanticscholar.org/CorpusID:207870692.
  15. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113, 2022. URL https://api.semanticscholar.org/CorpusID:247951931.
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  17. Training verifiers to solve math word problems. ArXiv, abs/2110.14168, 2021. URL https://api.semanticscholar.org/CorpusID:239998651.
  18. Language model cascades. ArXiv, abs/2207.10342, 2022.
  19. Raft: Reward ranked finetuning for generative foundation model alignment. ArXiv, abs/2304.06767, 2023. URL https://api.semanticscholar.org/CorpusID:258170300.
  20. A study on improving reasoning in language models. In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models, 2023. URL https://openreview.net/forum?id=tCZFmDyPFm.
  21. Alpacafarm: A simulation framework for methods that learn from human feedback. ArXiv, abs/2305.14387, 2023. URL https://api.semanticscholar.org/CorpusID:258865545.
  22. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  23. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2022. URL https://api.semanticscholar.org/CorpusID:252992904.
  24. Improving alignment of dialogue agents via targeted human judgements. ArXiv, abs/2209.14375, 2022. URL https://api.semanticscholar.org/CorpusID:252596089.
  25. Reinforced self-training (rest) for language modeling. ArXiv, abs/2308.08998, 2023. URL https://api.semanticscholar.org/CorpusID:261031028.
  26. Measuring mathematical problem solving with the math dataset, 2021a.
  27. Measuring mathematical problem solving with the math dataset. ArXiv, abs/2103.03874, 2021b. URL https://api.semanticscholar.org/CorpusID:232134851.
  28. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
  29. Large language models can self-improve. ArXiv, abs/2210.11610, 2022.
  30. Prioritized level replay. In International Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:222208809.
  31. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.
  32. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. ArXiv, abs/2207.01780, 2022. URL https://api.semanticscholar.org/CorpusID:250280117.
  33. Evolution through large models. 2022. URL https://api.semanticscholar.org/CorpusID:249848020.
  34. Solving quantitative reasoning problems with language models. ArXiv, abs/2206.14858, 2022. URL https://api.semanticscholar.org/CorpusID:250144408.
  35. Starcoder: may the source be with you! ArXiv, abs/2305.06161, 2023. URL https://api.semanticscholar.org/CorpusID:258588247.
  36. Holistic evaluation of language models. ArXiv, abs/2211.09110, 2022. URL https://api.semanticscholar.org/CorpusID:263423935.
  37. Let’s verify step by step. ArXiv, abs/2305.20050, 2023. URL https://api.semanticscholar.org/CorpusID:258987659.
  38. Rainier: Reinforced knowledge introspector for commonsense question answering. ArXiv, abs/2210.03078, 2022. URL https://api.semanticscholar.org/CorpusID:252735191.
  39. Quark: Controllable text generation with reinforced unlearning. ArXiv, abs/2205.13636, 2022. URL https://api.semanticscholar.org/CorpusID:249152301.
  40. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv, abs/2308.09583, 2023. URL https://api.semanticscholar.org/CorpusID:261030818.
  41. Gaia: a benchmark for general ai assistants. 2023. URL https://api.semanticscholar.org/CorpusID:265351664.
  42. Cross-task generalization via natural language crowdsourcing instructions. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:237421373.
  43. Lila: A unified benchmark for mathematical reasoning. 2022. URL https://api.semanticscholar.org/CorpusID:257405677.
  44. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  45. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.org/CorpusID:246426909.
  46. Are nlp models really able to solve simple math word problems?, 2021.
  47. Inferring the reader: Guiding automated story generation with commonsense reasoning, 2021.
  48. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023. URL https://api.semanticscholar.org/CorpusID:260334759.
  49. Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290, 2023. URL https://api.semanticscholar.org/CorpusID:258959321.
  50. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. ArXiv, abs/2210.01241, 2022. URL https://api.semanticscholar.org/CorpusID:252693405.
  51. Gpqa: A graduate-level google-proof q&a benchmark. ArXiv, abs/2311.12022, 2023. URL https://api.semanticscholar.org/CorpusID:265295009.
  52. Code llama: Open foundation models for code, 2023.
  53. Learning montezuma’s revenge from a single demonstration. ArXiv, abs/1812.03381, 2018. URL https://api.semanticscholar.org/CorpusID:54463584.
  54. Arb: Advanced reasoning benchmark for large language models. ArXiv, abs/2307.13692, 2023. URL https://api.semanticscholar.org/CorpusID:260155126.
  55. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023. URL https://api.semanticscholar.org/CorpusID:256697342.
  56. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017. URL https://api.semanticscholar.org/CorpusID:28695052.
  57. Improving neural machine translation models with monolingual data. ArXiv, abs/1511.06709, 2015. URL https://api.semanticscholar.org/CorpusID:15600925.
  58. Pangu-coder2: Boosting large language models for code with ranking feedback. ArXiv, abs/2307.14936, 2023. URL https://api.semanticscholar.org/CorpusID:260202985.
  59. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ArXiv, abs/1712.01815, 2017. URL https://api.semanticscholar.org/CorpusID:33081038.
  60. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. 2022. URL https://api.semanticscholar.org/CorpusID:263625818.
  61. Learning to summarize from human feedback. ArXiv, abs/2009.01325, 2020. URL https://api.semanticscholar.org/CorpusID:221665105.
  62. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of rlhf. ArXiv, abs/2309.09055, 2023. URL https://api.semanticscholar.org/CorpusID:261884455.
  63. Llama 2: Open foundation and fine-tuned chat models, 2023.
  64. Solving math word problems with process- and outcome-based feedback. ArXiv, abs/2211.14275, 2022. URL https://api.semanticscholar.org/CorpusID:254017497.
  65. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575:350 – 354, 2019. URL https://api.semanticscholar.org/CorpusID:204972004.
  66. Finetuned language models are zero-shot learners. ArXiv, abs/2109.01652, 2021. URL https://api.semanticscholar.org/CorpusID:237416585.
  67. Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
  68. Keep calm and explore: Language models for action generation in text-based games. ArXiv, abs/2010.02903, 2020. URL https://api.semanticscholar.org/CorpusID:222142129.
  69. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629, 2022. URL https://api.semanticscholar.org/CorpusID:252762395.
  70. Tree of thoughts: Deliberate problem solving with large language models. ArXiv, abs/2305.10601, 2023. URL https://api.semanticscholar.org/CorpusID:258762525.
  71. Neural story planning. ArXiv, abs/2212.08718, 2022. URL https://api.semanticscholar.org/CorpusID:254854533.
  72. Scaling relationship on learning mathematical reasoning with large language models. ArXiv, abs/2308.01825, 2023. URL https://api.semanticscholar.org/CorpusID:260438790.
  73. Star: Bootstrapping reasoning with reasoning, 2022.
  74. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. ArXiv, abs/2308.07921, 2023a. URL https://api.semanticscholar.org/CorpusID:260900008.
  75. Dialogue shaping: Empowering agents through npc interaction. ArXiv, abs/2307.15833, 2023b. URL https://api.semanticscholar.org/CorpusID:260333931.
  76. Solving math word problems via cooperative reasoning induced language models. Association for Computational Linguistics, 2023. 10.18653/v1/2023.acl-long.245. URL https://doi.org/10.18653%2Fv1%2F2023.acl-long.245.
  77. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Alex Havrilla (13 papers)
  2. Yuqing Du (28 papers)
  3. Sharath Chandra Raparthy (10 papers)
  4. Christoforos Nalmpantis (5 papers)
  5. Jane Dwivedi-Yu (26 papers)
  6. Maksym Zhuravinskyi (6 papers)
  7. Eric Hambro (11 papers)
  8. Sainbayar Sukhbaatar (53 papers)
  9. Roberta Raileanu (40 papers)
Citations (41)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com