Emergent Mind

Advancing LLM Reasoning Generalists with Preference Trees

(2404.02078)
Published Apr 2, 2024 in cs.AI , cs.CL , and cs.LG

Abstract

We introduce Eurus, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

Eurus models rival larger baselines on LeetCode, TheoremQA benchmarks, matching GPT-3.5 Turbo's performance.

Overview

  • The paper introduces Eurus, a suite of LLMs, and UltraInteract, a novel dataset designed to improve complex reasoning in LLMs through fine-tuning and preference learning.

  • Eurus models have demonstrated superior reasoning performance over both open-source and some proprietary models, excelling in benchmarks like LeetCode and TheoremQA.

  • UltraInteract employs preference trees to offer diverse reasoning strategies and multi-turn interactions, significantly enhancing the problem-solving abilities of LLMs.

  • The research highlights the development of a new reward modeling objective in preference learning, leading to improved reasoning proficiency in the Eurus models.

Advancing LLM Reasoning Generalists with Preference Trees

Introduction to Eurus and UltraInteract

Recent advancements in machine learning have significantly propelled the capabilities of LLMs in diverse tasks. A persistent challenge, however, remains in enhancing LLMs' performance in complex reasoning tasks. This paper introduces Eurus, a suite of LLMs that has achieved remarkable results across a variety of benchmarks in mathematics, code generation, and logical reasoning, owing to the novel dataset UltraInteract. UltraInteract pioneers in offering high-quality, large-scale alignment data specifically curated for complex reasoning, enabling both supervised fine-tuning and advanced preference learning strategies.

Eurus Models: Achievements in Reasoning

Eurus demonstrates exceptional capabilities over existing open-source models and even rivals proprietary models like GPT-3.5 Turbo in reasoning tasks. The noteworthy accomplishments of Eurus include unparalleled performance on stringent benchmarks such as LeetCode and TheoremQA, where it outperforms by more than 13.3% margins. These milestones underscore the efficacy of UltraInteract in sharpening the reasoning skills of LLMs, making Eurus a leading force among reasoning generalists.

UltraInteract: Constructing Preference Trees for Complex Reasoning

UltraInteract stands out with its unique approach of constructing preference trees that encapsulate a variety of reasoning strategies, multi-turn interactions, and action pairs for preference learning. Each preference tree enriches the dataset with diverse reasoning trajectories, promoting flexibility and depth in problem-solving approaches. This broad spectrum of reasoning chains and interaction patterns is instrumental in the remarkable performance leap observed with Eurus models.

Insights from Preference Learning Exploration

A deep dive into preference learning within Eurus reveals intriguing findings. Contrary to conventional applications, algorithms like DPO exhibit decreased suitability for reasoning tasks, hinting at the unique requirements of reasoning over general conversational contexts. This observation led to the development of a novel reward modeling objective that significantly amplified Eurus's reasoning proficiency, showcasing the importance of tailored approaches in preference learning for reasoning capabilities.

Theoretical and Practical Implications

The introduction of Eurus and UltraInteract not only sets new benchmarks in the domain of reasoning within LLMs but also opens avenues for future exploration. The detailed analysis of preference learning algorithms provides foundational insights for what constitutes effective learning paradigms for complex reasoning. Furthermore, the public availability of Eurus models and UltraInteract dataset equips the research community with powerful tools to continue advancing the frontiers of LLM reasoning abilities.

Concluding Remarks

In sum, Eurus represents a significant stride forward in cultivating LLMs' reasoning capacities. Through UltraInteract's meticulously designed preference trees and the exploration of tailored preference learning techniques, Eurus achieves state-of-the-art results, challenging existing paradigms and setting the stage for future innovations in LLM reasoning generalists. The findings from this research not only elevate the capabilities of open-source models but also furnish valuable strategies for enhancing LLMs' reasoning through specialized alignment and learning methodologies.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews
Can LLMs Every Reason? (1 point, 1 comment)
References
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proc. of NAACL-HLT
  2. Program Synthesis with Large Language Models
  3. Qwen Technical Report
  4. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39
  6. Noise Contrastive Alignment of Language Models with Explicit Rewards
  7. Evaluating large language models trained on code
  8. TheoremQA: A Theorem-driven Question Answering dataset
  9. Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
  10. Training Verifiers to Solve Math Word Problems
  11. UltraFeedback: Boosting Language Models with High-quality Feedback
  12. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
  13. Enhancing chat language models by scaling high-quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing
  14. KTO: Model Alignment as Prospect Theoretic Optimization
  15. Specializing smaller language models towards multi-step reasoning. In Proceedings of the International Conference on Machine Learning
  16. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9
  17. DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
  18. Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
  19. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a.
  20. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
  21. Mistral 7B
  22. Mixtral of experts. 2024.
  23. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Annual Meeting of the Association for Computational Linguistics, 2023b.
  24. Rewardbench: Evaluating reward models for language modeling. 2024.
  25. Generative Judge for Evaluating Alignment
  26. GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers
  27. TACO: Topics in Algorithmic COde generation dataset
  28. Competition-Level Code Generation with AlphaCode
  29. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca

  30. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. 2023.
  31. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proceedings of ICLR
  32. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
  33. Wizardcoder: Empowering code large language models with evol-instruct, 2023b
  34. A diverse corpus for evaluating and developing English math word problem solvers. In Proc. of ACL
  35. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proc. of ACL
  36. Orca 2: Teaching Small Language Models How to Reason
  37. Orca-Math: Unlocking the potential of SLMs in Grade School Math
  38. OpenAI. Gpt-4 technical report
  39. Compositional semantic parsing on semi-structured tables. In Proc. of ACL
  40. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  41. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Conference on Empirical Methods in Natural Language Processing
  42. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
  43. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  44. Code Llama: Open Foundation Models for Code
  45. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
  46. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
  47. OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
  48. Llama 2: Open Foundation and Fine-Tuned Chat Models
  49. Zephyr: Direct Distillation of LM Alignment
  50. OpenChat: Advancing Open-source Language Models with Mixed-Quality Data
  51. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
  52. Executable Code Actions Elicit Better LLM Agents
  53. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  54. Magicoder: Source code is all you need
  55. CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
  56. Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models
  57. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proc. of EMNLP
  58. Reclor: A reading comprehension dataset requiring logical reasoning. In Proc. of ICLR
  59. CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets
  60. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
  61. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  62. OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
  63. Instruction-Following Evaluation for Large Language Models
  64. Starling-7b: Improving llm helpfulness & harmlessness with rlaif

Show All 64