Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights (2410.09008v1)

Published 11 Oct 2024 in cs.CL

Abstract: LLMs like GPT-4, PaLM, and LLaMA have shown significant improvements in various reasoning tasks. However, smaller models such as Llama-3-8B and DeepSeekMath-Base still struggle with complex mathematical reasoning because they fail to effectively identify and correct reasoning errors. Recent reflection-based methods aim to address these issues by enabling self-reflection and self-correction, but they still face challenges in independently detecting errors in their reasoning steps. To overcome these limitations, we propose SuperCorrect, a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model. In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts. In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model by following the teacher's correction traces during training. This cross-model DPO approach teaches the student model to effectively locate and resolve erroneous thoughts with error-driven insights from the teacher model, breaking the bottleneck of its thoughts and acquiring new skills and knowledge to tackle challenging problems. Extensive experiments consistently demonstrate our superiority over previous methods. Notably, our SuperCorrect-7B model significantly surpasses powerful DeepSeekMath-7B by 7.8%/5.3% and Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks, achieving new SOTA performance among all 7B models. Code: https://github.com/YangLing0818/SuperCorrect-LLM

Essay on "SuperCorrect: Supervising and Correcting LLMs with Error-Driven Insights"

The paper "SuperCorrect: Supervising and Correcting LLMs with Error-Driven Insights" addresses the challenges smaller LLMs face when handling complex mathematical reasoning tasks. Despite advancements in LLMs, smaller models like Llama-3-8B and DeepSeekMath-Base continue to exhibit limitations in this domain. The authors propose a two-stage framework, SuperCorrect, that introduces a large teacher model to guide and refine the reasoning and reflection processes of a smaller student model.

Key Contributions

  1. Two-Stage Framework:
    • The framework employs a large teacher model to supervise the smaller student model in reasoning tasks. It integrates thought templates and collaborative optimization techniques to significantly improve the student model's self-correction capabilities.
  2. Hierarchical Thought Templates:
    • In the first stage, hierarchical high-level and detailed thought templates are extracted from the teacher model. These templates guide the student model to develop finer reasoning processes. This approach surpasses traditional methods like CoT and BoT, providing deeper insights necessary for error correction.
  3. Cross-Model Collaborative Direct Preference Optimization (DPO):
    • The second stage introduces cross-model collaborative DPO. This innovative optimization framework leverages the teacher's correction traces to enhance the student model's self-correction abilities. It allows the student model to break through its reasoning bottlenecks and acquire new problem-solving skills.

Experimental Results

The experimentation confirms SuperCorrect’s superiority over existing methods. Notably, the SuperCorrect-7B model outperforms the DeepSeekMath-7B by 7.8%/5.3% and the Qwen2.5-Math-7B by 15.1%/6.3% on MATH and GSM8K benchmarks, respectively. These results establish new state-of-the-art performance among all 7B models.

Implications and Future Directions

SuperCorrect presents both theoretical and practical implications. Theoretically, it advances the understanding of error correction in LLMs by utilizing external model supervision. Practically, it provides a method for improving the performance of smaller models, which are often more accessible due to lower computational requirements.

Looking forward, the framework could be extended to explore:

  • Generalizations to larger models,
  • Applications across diverse reasoning tasks beyond mathematics,
  • Further optimization of cross-model corrective interactions.

Conclusion

SuperCorrect catalyzes advancements in error-driven insights for LLMs, marking significant progress in their ability to handle mathematically intensive reasoning tasks. It delineates a strategic pathway for employing larger models as educators for their smaller counterparts, thus enhancing the efficacy of LLMs in an efficient and scalable manner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Ask me anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations, 2022.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  6. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  17682–17690, 2024.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Iterative translation refinement with large language models. arXiv preprint arXiv:2306.03856, 2023a.
  9. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  10. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research, 2023b.
  11. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  13. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  14. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  15. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  16. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo. org/records/10256836, 7, 2023a.
  17. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023b.
  18. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
  19. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  20. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  21. Human-centric dialog training via offline reinforcement learning. arXiv preprint arXiv:2010.05848, 2020.
  22. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  23. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2022.
  24. Language models can solve computer tasks. Advances in Neural Information Processing Systems, 36, 2024.
  25. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  26. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.
  27. Common 7b language models already possess strong math capabilities. arXiv preprint arXiv:2403.04706, 2024a.
  28. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. 2024b.
  29. Reflection-tuning: Recycling data for better instruction-tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  30. Selective reflection-tuning: Student-selected data recycling for llm instruction-tuning. arXiv preprint arXiv:2402.10110, 2024c.
  31. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024.
  32. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms. arXiv preprint arXiv:2402.16352, 2024.
  33. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  34. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
  35. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436, 2023.
  36. Skeleton-of-thought: Large language models can do parallel decoding. In The Twelfth International Conference on Learning Representations, 2023.
  37. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  38. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  5687–5711, 2023.
  39. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  40. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  41. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  42. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  45. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  46. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  47. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024.
  48. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  50. Sequence Tutor. Conservative fine-tuning of sequence generation models with kl-control natasha jaques, shixiang gu, dzmitry bahdanau, josé miguel hernández-lobato, richard e. Turner, Douglas Eck arXiv (2016-11-09) https://arxiv. org/abs/1611.02796 v9.
  51. Llms cannot find reasoning errors, but can correct them given the error location. In Findings of the Association for Computational Linguistics ACL 2024, pp.  13894–13908, 2024.
  52. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  53. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024a.
  54. Buffer of thoughts: Thought-augmented reasoning with large language models. arXiv preprint arXiv:2406.04271, 2024b.
  55. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  56. Internlm-math: Open math large language models toward verifiable reasoning. arXiv preprint arXiv:2402.06332, 2024.
  57. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  58. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  59. Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548, 2024.
  60. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024.
  61. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474, 2023.
  62. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations, 2022.
  63. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2022.
  64. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. arXiv preprint arXiv:2406.11931, 2024.
  65. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ling Yang (88 papers)
  2. Zhaochen Yu (7 papers)
  3. Tianjun Zhang (38 papers)
  4. Minkai Xu (40 papers)
  5. Joseph E. Gonzalez (167 papers)
  6. Bin Cui (165 papers)
  7. Shuicheng Yan (275 papers)