Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards (2404.10346v4)

Published 16 Apr 2024 in cs.CL

Abstract: Training on large amounts of rationales (i.e., CoT Fine-tuning) is effective at improving the reasoning capabilities of LLMs. However, acquiring human-authored rationales or augmenting rationales from proprietary models is costly and not scalable. In this paper, we study the problem of whether LLMs could self-improve their reasoning capabilities. To this end, we propose Self-Explore, where the LLM is tasked to explore the first wrong step (i.e., the first pit) within the rationale and use such signals as fine-grained rewards for further improvement. On the GSM8K and MATH test set, Self-Explore achieves 11.57% and 2.89% improvement on average across three LLMs compared to supervised fine-tuning (SFT). Our code is available at https://github.com/hbin0701/Self-Explore.

Enhancing LLMs' Reasoning Capabilities Through Self-Training: An Insight into Self-Explore

Introduction to Self-Explore

The development of LLMs has increasingly focused on improving reasoning capabilities through various means, including Chain-of-Thought prompting and fine-tuning with human-authored rationales. Despite the effectiveness of these methods, they are often hindered by the high costs and scalability issues associated with generating and acquiring high-quality rationales. Addressing this challenge, the Self-Explore methodology presents a novel approach to enhance the reasoning faculties of LLMs through self-improvement, leveraging fine-grained rewards derived from the model's own generated rationales.

Methodological Overview

Self-Explore operates under a two-fold process, initially employing step-level exploration within the generated rationales to identify errors (referred to as the "first pit") and subsequently utilizing these insights as a basis for fine-tuning. This process involves the creation of a pairwise dataset categorized by positive and negative step-level samples, which is then subjected to preference learning objectives, thus refining the model's reasoning path in a granular manner. Remarkably, Self-Explore demonstrated consistent improvements across three distinct LLMs without depending on distillation from proprietary models.

Empirical Evaluations

The performance of Self-Explore was rigorously tested on the GSM8K and MATH datasets, showcasing substantial improvements over traditional Supervised Fine-Tuning (SFT) across all models. Notably, enhancements of 13.19%, 10.23%, and 11.30% on GSM8K and 1.98%, 3.16%, and 3.54% on MATH were observed for Mistral-7B, Llemma-7B, and Deepseek-Math 7B models, respectively. These results underscore the method's effectiveness, particularly when compared to approaches solely based on outcome-level supervision.

Theoretical Implications and Future Directions

The advent of Self-Explore not only advances the capabilities of LLMs in processing complex reasoning tasks but also illuminates the potential of self-training mechanisms in circumventing the limitations posed by the acquisition of high-quality training data. The approach suggests a promising trajectory towards realizing more autonomous and efficient methods for improving LLMs, potentially extending beyond mathematical reasoning to broader cognitive domains.

Furthermore, the methodology demonstrates the utility of fine-grained, step-level feedback in refining the reasoning processes of LLMs. By focusing on the first incorrect step in a rationale, Self-Explore provides a more targeted and effective learning signal than general outcome-based supervision. This detailed level of feedback could inspire future works to explore similar fine-grained approaches in other domains or for different types of reasoning tasks, potentially leading to broader applications for self-improvement in AI.

Conclusion

Self-Explore represents a significant stride towards enhancing the reasoning capabilities of LLMs through self-improvement. By efficiently leveraging the model's own generated rationales for fine-tuning, it not only overcomes the practical challenges associated with rationale acquisition but also sets a precedent for future research in self-training methodologies. As we continue to explore these avenues, the potential for developing more nuanced and autonomous LLMs becomes increasingly tangible, promising new frontiers in the field of artificial intelligence reasoning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. A general theoretical paradigm to understand learning from human preferences.
  2. Llemma: An open language model for mathematics.
  3. Self-play fine-tuning converts weak language models to strong language models.
  4. Training verifiers to solve math word problems.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding.
  6. Kto: Model alignment as prospect theoretic optimization.
  7. Specializing smaller language models towards multi-step reasoning.
  8. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
  9. Reinforced self-training (rest) for language modeling.
  10. Teaching large language models to reason with reinforcement learning.
  11. Glore: When, where, and how to improve llm reasoning via global and local refinements.
  12. Measuring mathematical problem solving with the math dataset.
  13. Orpo: Monolithic preference optimization without reference model.
  14. V-star: Training verifiers for self-taught reasoners.
  15. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  16. Mistral 7b.
  17. Learning planning-based reasoning by trajectories collection and process reward synthesizing.
  18. Cotever: Chain of thought prompting annotation toolkit for explanation verification. arXiv preprint arXiv:2303.03628.
  19. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning.
  20. Understanding the effects of rlhf on llm generalisation and diversity.
  21. Large language models are zero-shot reasoners.
  22. Efficient memory management for large language model serving with pagedattention.
  23. Solving quantitative reasoning problems with language models.
  24. Common 7b language models already possess strong math capabilities.
  25. Explanations from large language models make small reasoners better.
  26. Let’s verify step by step.
  27. Tinygsm: achieving >80% on gsm8k with small language models.
  28. Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding.
  29. The flan collection: Designing data and methods for effective instruction tuning.
  30. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
  31. Orca 2: Teaching small language models how to reason.
  32. Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830.
  33. Orca-math: Unlocking the potential of slms in grade school math.
  34. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  35. Learning math reasoning from self-sampled correct and partially-correct solutions.
  36. Gpt-4 technical report.
  37. Smaug: Fixing failure modes of preference optimisation with dpo-positive.
  38. Direct preference optimization: Your language model is secretly a reward model.
  39. Proximal policy optimization algorithms.
  40. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
  41. Does knowledge distillation really work?
  42. Gemini: A family of highly capable multimodal models.
  43. Openmathinstruct-1: A 1.8 million math instruction tuning dataset.
  44. Zephyr: Direct distillation of lm alignment.
  45. Math-shepherd: Verify and reinforce llms step-by-step without human annotations.
  46. Self-consistency improves chain of thought reasoning in language models.
  47. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision.
  48. Chain-of-thought prompting elicits reasoning in large language models.
  49. Self-evaluation guided beam search for reasoning.
  50. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
  51. Outcome-supervised verifiers for planning in mathematical reasoning.
  52. Metamath: Bootstrap your own mathematical questions for large language models.
  53. Self-rewarding language models.
  54. Scaling relationship on learning mathematical reasoning with large language models.
  55. Star: Bootstrapping reasoning with reasoning.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hyeonbin Hwang (11 papers)
  2. Doyoung Kim (19 papers)
  3. Seungone Kim (34 papers)
  4. Seonghyeon Ye (25 papers)
  5. Minjoon Seo (82 papers)
Citations (6)