Enhancing Multi-Step Reasoning Abilities of Language Models through Direct Q-Function Optimization (2410.09302v2)
Abstract: Reinforcement Learning (RL) plays a crucial role in aligning LLMs with human preferences and improving their ability to perform complex tasks. However, current approaches either require significant computational resources due to the use of multiple models and extensive online sampling for training (e.g., PPO) or are framed as bandit problems (e.g., DPO, DRO), which often struggle with multi-step reasoning tasks, such as math problem solving and complex reasoning that involve long chains of thought. To overcome these limitations, we introduce Direct Q-function Optimization (DQO), which formulates the response generation process as a Markov Decision Process (MDP) and utilizes the soft actor-critic (SAC) framework to optimize a Q-function directly parameterized by the LLM. The MDP formulation of DQO offers structural advantages over bandit-based methods, enabling more effective process supervision. Experimental results on two math problem-solving datasets, GSM8K and MATH, demonstrate that DQO outperforms previous methods, establishing it as a promising offline reinforcement learning approach for aligning LLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Enhancing textbook question answering task with large language models and retrieval augmented generation. arXiv preprint arXiv:2402.05128, 2024.
- A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447–4455. PMLR, 2024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Step-level value preference optimization for mathematical reasoning. arXiv preprint arXiv:2406.10858, 2024a.
- Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations, 2024b.
- On the weaknesses of reinforcement learning for neural machine translation. In International Conference on Learning Representations, 2020.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Psydial: Personality-based synthetic dialogue generation using large language models. arXiv preprint arXiv:2404.00930, 2024.
- Measuring mathematical problem solving with the math dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- Mindstar: Enhancing math reasoning in pre-trained llms at inference time. arXiv preprint arXiv:2405.16265, 2024.
- Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. In The Twelfth International Conference on Learning Representations, 2024.
- Nash learning from human feedback. In Forty-first International Conference on Machine Learning, 2024.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Hiql: Offline goal-conditioned rl with latent states as actions. Advances in Neural Information Processing Systems, 36, 2024.
- Reasoning capacity in multi-agent systems: Limitations, challenges and human-centered solutions. arXiv preprint arXiv:2402.01108, 2024.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Offline regularised reinforcement learning for large language models alignment. arXiv preprint arXiv:2405.19107, 2024.
- Learning by playing solving sparse reward tasks from scratch. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 4344–4353, 10–15 Jul 2018.
- Self-critiquing models for assisting human evaluators. CoRR, 2022.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Multi-turn reinforcement learning from preference human feedback. arXiv preprint arXiv:2405.14655, 2024.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
- Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Breadcrumbs to the goal: goal-conditioned exploration from human-in-the-loop feedback. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 63222–63258, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Opendevin: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
- Monte carlo augmented actor-critic for sparse reward deep reinforcement learning from suboptimal demonstrations. Advances in neural information processing systems, 35:2254–2267, 2022.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
- Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
- Monte carlo tree search boosts reasoning via iterative preference learning. arXiv preprint arXiv:2405.00451, 2024.
- Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In Forty-first International Conference on Machine Learning, 2024a.
- Building math agents with multi-turn iterative preference learning. arXiv preprint arXiv:2409.02392, 2024b.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
- Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Rest-mcts*: Llm self-training via process reward guided tree search. arXiv preprint arXiv:2406.03816, 2024a.
- Iterative nash policy optimization: Aligning llms with general preferences via no-regret learning. arXiv preprint arXiv:2407.00617, 2024b.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.