Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing (2402.00658v3)

Published 1 Feb 2024 in cs.AI and cs.CL

Abstract: LLMs have demonstrated significant potential in handling complex reasoning tasks through step-by-step rationale generation. However, recent studies have raised concerns regarding the hallucination and flaws in their reasoning process. Substantial efforts are being made to improve the reliability and faithfulness of the generated rationales. Some approaches model reasoning as planning, while others focus on annotating for process supervision. Nevertheless, the planning-based search process often results in high latency due to the frequent assessment of intermediate reasoning states and the extensive exploration space. Additionally, supervising the reasoning process with human annotation is costly and challenging to scale for LLM training. To address these issues, in this paper, we propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories, which are ranked according to synthesized process rewards. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework, showing that our 7B model can surpass the strong counterparts like GPT-3.5-Turbo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. A general theoretical paradigm to understand learning from human preferences. CoRR, abs/2310.12036.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609.
  3. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4.
  5. Deep reinforcement learning from human preferences. In NeurIPS, pages 4299–4307.
  6. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  7. Rémi Coulom. 2006. Efficient selectivity and backup operators in monte-carlo tree search. In Computers and Games, 5th International Conference, volume 4630 of Lecture Notes in Computer Science, pages 72–83. Springer.
  8. Google DeepMind Gemma Team. 2024. Gemma: Open models based on gemini research and technology. Preprint, arXiv:2403.08295.
  9. Reasoning with language model is planning with world model. CoRR, abs/2305.14992.
  10. Measuring mathematical problem solving with the math dataset. NeurIPS.
  11. Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
  12. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of ACL, pages 1049–1065. ACL.
  13. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. CoRR, abs/2311.05232.
  14. Mixtral of experts. Preprint, arXiv:2401.04088.
  15. Exploring self-supervised logic-enhanced training for large language models. In NAACL. ACL.
  16. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  17. Measuring faithfulness in chain-of-thought reasoning. CoRR, abs/2307.13702.
  18. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  19. Let’s verify step by step. CoRR, abs/2305.20050.
  20. Logiqa2.0 dataset - logical reasoning in mrc and nli tasks. TASLP.
  21. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583.
  22. OpenAI. 2023. Gpt-4 technical report. Technical report.
  23. Training language models to follow instructions with human feedback. In NeurIPS.
  24. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  25. Proximal policy optimization algorithms. CoRR, abs/1707.06347.
  26. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300.
  27. Reflexion: Language agents with verbal reinforcement learning. In NeurIPS.
  28. Beyond human data: Scaling self-training for problem-solving with language models. CoRR, abs/2312.06585.
  29. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  30. Solving math word problems with process- and outcome-based feedback. CoRR, abs/2211.14275.
  31. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. In NAACL. ACL.
  32. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935.
  33. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  34. Wizardlm: Empowering large language models to follow complex instructions. CoRR, abs/2304.12244.
  35. Tree of thoughts: Deliberate problem solving with large language models. CoRR, abs/2305.10601.
  36. React: Synergizing reasoning and acting in language models. In ICLR. OpenReview.net.
  37. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284.
  38. ALERT: adapting language models to reasoning tasks. CoRR, abs/2212.08286.
  39. Reclor: A reading comprehension dataset requiring logical reasoning. In ICLR. OpenReview.
  40. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825.
  41. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
  43. LIMA: less is more for alignment. CoRR, abs/2305.11206.
  44. Least-to-most prompting enables complex reasoning in large language models. In ICLR. OpenReview.net.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Fangkai Jiao (19 papers)
  2. Chengwei Qin (28 papers)
  3. Zhengyuan Liu (41 papers)
  4. Nancy F. Chen (97 papers)
  5. Shafiq Joty (187 papers)
Citations (12)