Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting LLM Reasoning via Spontaneous Self-Correction (2506.06923v1)

Published 7 Jun 2025 in cs.AI

Abstract: While LLMs have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one. One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles -- solution proposer and verifier -- to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.

Overview of "Boosting LLM Reasoning via Spontaneous Self-Correction"

Introduction

The paper "Boosting LLM Reasoning via Spontaneous Self-Correction" addresses the persistent challenge of mathematical reasoning within LLMs. Mathematical tasks are notably difficult due to their structured and symbolic nature. Even though recent advancements in self-correction paradigms have demonstrated potential, there remains an uncertainty regarding their efficiency and practicality, as they often necessitate explicit external feedback mechanisms.

SPOC Framework

To counter these challenges, this paper introduces the Spontaneous Self-Correction (SPOC) approach. Unlike traditional methods which depend on predefined prompt strategies for triggering correction, SPOC adopts an open-loop inference model. This allows LLMs to autonomously generate solutions and verify their correctness in one inference pass. SPOC is designed to initiate self-correction operations exclusively when self-verification reveals errors, iteratively refining the solution without external prompts during generation.

Methodology

The SPOC approach leverages a multi-agent formalism where solution proposals and verifications occur as interactions between a solution proposer and a verifier. This dual-role framework allows the model to engage in self-play training to enhance its capabilities without the need for stronger teacher models. SPOC's pipeline includes a synthetic data generation phase, followed by supervised fine-tuning to establish a multi-turn generation style, and culminates in online reinforcement learning to maximize correction accuracies.

Experimental Results

Empirical evaluations demonstrate significant improvements in pass@1 accuracy across various mathematical reasoning tasks and model sizes. For instance, Llama-3.1-8B and 70B models exhibited performance gains on datasets like MATH500 and AMC23, with elevations of up to 20%. The enhancements observed are achieved without relying on distillation from superior models, highlighting SPOC's robust self-correction capabilities.

Implications and Future Directions

The implementation of SPOC presents important implications for optimizing reasoning tasks within LLMs. By facilitating dynamic inference-time scaling, SPOC offers practical advancements in both computational efficiency and model accuracy. The theoretical contributions also pave the way for future research in self-correction methodologies applicable beyond mathematical domains. Future investigations might explore extending SPOC to partial solutions within broader reasoning chains, or adapting its methodologies to other complex reasoning tasks, thereby enhancing LLM applicability across diverse fields.

Conclusion

The SPOC framework represents a critical advancement in LLM reasoning strategies, allowing models to autonomously verify and correct their outputs efficiently. This synergy of self-play with synthetic fine-tuning offers a promising trajectory for elevating the logical and analytical capabilities of LLMs in practical deployments. As research on artificial intelligence progresses, methodologies like SPOC could become foundational to enabling increasingly sophisticated AI reasoning and decision-making tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Xutong Zhao (4 papers)
  2. Tengyu Xu (27 papers)
  3. Xuewei Wang (14 papers)
  4. Zhengxing Chen (20 papers)
  5. Di Jin (104 papers)
  6. Liang Tan (22 papers)
  7. Yen-Ting (1 paper)
  8. Zishun Yu (7 papers)
  9. Zhuokai Zhao (21 papers)
  10. Yun He (26 papers)
  11. Sinong Wang (45 papers)
  12. Han Fang (61 papers)
  13. Sarath Chandar (93 papers)
  14. Chen Zhu (103 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com