Advancing LLM Reasoning Generalists with Preference Trees (2404.02078v1)
Abstract: We introduce Eurus, a suite of LLMs optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.
- MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proc. of NAACL-HLT, 2019.
- Program synthesis with large language models. ArXiv preprint, abs/2108.07732, 2021.
- Qwen technical report. ArXiv preprint, abs/2309.16609, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv preprint, abs/2204.05862, 2022.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39, 1952.
- Noise contrastive alignment of language models with explicit rewards. ArXiv preprint, abs/2402.05369, 2024a.
- Evaluating large language models trained on code, 2021.
- Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023.
- Agent-flan: Designing data and methods of effective agent tuning for large language models. volume abs/2403.12881, 2024b.
- Training verifiers to solve math word problems. volume abs/2110.14168, 2021.
- Ultrafeedback: Boosting language models with high-quality feedback. ArXiv preprint, abs/2310.01377, 2023.
- DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. ArXiv preprint, abs/2401.02954, 2024.
- Enhancing chat language models by scaling high-quality instructional conversations. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Kto: Model alignment as prospect theoretic optimization. ArXiv preprint, abs/2402.01306, 2024.
- Specializing smaller language models towards multi-step reasoning. In Proceedings of the International Conference on Machine Learning, 2023.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9, 2021.
- Deepseek-coder: When the large language model meets programming - the rise of code intelligence. ArXiv preprint, abs/2401.14196, 2024a.
- Controllable preference optimization: Toward controllable multi-objective alignment. ArXiv preprint, abs/2402.19085, 2024b.
- Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a.
- Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
- Mistral 7b. ArXiv preprint, abs/2310.06825, 2023a.
- Mixtral of experts. 2024.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Annual Meeting of the Association for Computational Linguistics, 2023b.
- Rewardbench: Evaluating reward models for language modeling. 2024.
- Generative judge for evaluating alignment. ArXiv preprint, abs/2310.05470, 2023a.
- Gsm-plus: A comprehensive benchmark for evaluating the robustness of llms as mathematical problem solvers. ArXiv preprint, abs/2402.19255, 2024.
- Taco: Topics in algorithmic code generation dataset. volume abs/2312.14852, 2023b.
- Competition-level code generation with alphacode. volume abs/2203.07814, 2022.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning. 2023.
- Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proceedings of ICLR, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023a.
- Wizardcoder: Empowering code large language models with evol-instruct, 2023b.
- A diverse corpus for evaluating and developing English math word problem solvers. In Proc. of ACL, 2020.
- NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proc. of ACL, 2022.
- Orca 2: Teaching small language models how to reason. ArXiv preprint, abs/2311.11045, 2023.
- Orca-math: Unlocking the potential of slms in grade school math. ArXiv preprint, abs/2402.14830, 2024.
- OpenAI. Gpt-4 technical report, 2023.
- Compositional semantic parsing on semi-structured tables. In Proc. of ACL, 2015.
- Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
- Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Conference on Empirical Methods in Natural Language Processing, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv preprint, abs/2307.16789, 2023.
- Direct preference optimization: Your language model is secretly a reward model. ArXiv preprint, abs/2305.18290, 2023.
- Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv preprint, abs/2402.03300, 2024.
- Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022.
- Openmathinstruct-1: A 1.8 million math instruction tuning dataset. arXiv preprint arXiv: Arxiv-2402.10176, 2024.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- Zephyr: Direct distillation of lm alignment. ArXiv preprint, abs/2310.16944, 2023.
- Openchat: Advancing open-source language models with mixed-quality data. ArXiv preprint, abs/2309.11235, 2023a.
- Mint: Evaluating llms in multi-turn interaction with tools and language feedback. ArXiv preprint, abs/2309.10691, 2023b.
- Executable code actions elicit better llm agents. ArXiv preprint, abs/2402.01030, 2024.
- Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022.
- Magicoder: Source code is all you need, 2023.
- Codeultrafeedback: An llm-as-a-judge dataset for aligning large language models to coding preferences. ArXiv preprint, abs/2403.09032, 2024.
- Perils of self-feedback: Self-bias amplifies in large language models. ArXiv preprint, abs/2402.11436, 2024.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proc. of EMNLP, 2018.
- Reclor: A reading comprehension dataset requiring logical reasoning. In Proc. of ICLR, 2020.
- Craft: Customizing llms by creating and retrieving from specialized toolsets. ArXiv preprint, abs/2309.17428, 2023.
- Mammoth: Building math generalist models through hybrid instruction tuning. ArXiv preprint, abs/2309.05653, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023.
- Opencodeinterpreter: Integrating code generation with execution and refinement. ArXiv preprint, abs/2402.14658, 2024.
- Instruction-following evaluation for large language models. ArXiv preprint, abs/2311.07911, 2023.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.