Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking (2503.19855v1)

Published 25 Mar 2025 in cs.CL

Abstract: Recent advances in LLMs, such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. For instance, the accuracy of QwQ-32B improved from 80.3% (Round 1) to 82.1% (Round 2) on the AIME 2024 dataset, while DeepSeek-R1 showed a similar increase from 79.7% to 82.0%. These results confirm that Multi-round Thinking is a broadly applicable, straightforward approach to achieving stable enhancements in model performance, underscoring its potential for future developments in test-time scaling techniques. The key prompt: {Original question prompt} The assistant's previous answer is: <answer> {last round answer} </answer>, and please re-answer.

Summary

Enhancing LLM Reasoning through Multi-round Test-time Thinking

The paper "Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking" introduces an innovative approach to improving the reasoning capabilities of LLMs through a method termed "Multi-round Thinking." This approach seeks to overcome existing limitations in handling long texts and reinforcement learning (RL) training efficiency, which are significant challenges in the current landscape of LLM development.

Core Concept and Methodology

Multi-round Thinking operates by iteratively refining a model's responses during test-time, using previous answers as prompts for subsequent rounds of reasoning. This technique allows models to reevaluate and adjust their conclusions, addressing cognitive errors embedded within prior responses. Specifically, the process involves discarding intermediate reasoning steps and leveraging the final answer from each round as the prompt for the subsequent iteration. This iterative mechanism is designed to mimic human cognitive strategies, thereby enhancing the model's ability to break free from entrenched reasoning patterns.

Experimental Validation

The authors conducted extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, demonstrating consistent improvements in model performance on benchmarks such as AIME 2024, MATH-500, GPQA-diamond, and LiveCodeBench. Notably, the accuracy of QwQ-32B increased from 80.3% to 82.1% between the first and second rounds on the AIME 2024 dataset, while DeepSeek-R1 exhibited a comparable improvement from 79.7% to 82.0%. These numerical results affirm the efficacy of Multi-round Thinking in achieving stable enhancements across diverse tasks.

Theoretical and Practical Implications

The findings of this paper suggest several theoretical and practical implications. Theoretically, Multi-round Thinking underscores the potential benefits of test-time scaling strategies in LLMs, offering insights into how iterative reasoning processes can resemble and harness the strengths of human thought patterns. Practically, this approach provides a straightforward mechanism for improving model accuracy without necessitating additional training overhead, thereby offering substantial value for real-world applications requiring reliable and efficient reasoning capabilities.

Limitations and Future Directions

Despite the promising results, the paper acknowledges certain limitations in current methodologies, such as defining fine-grained reasoning steps and mitigating reward hacking during reinforcement learning. These challenges highlight areas for further research and optimization. Moreover, while preliminary experiments combining Multi-round Thinking with supervised fine-tuning did not yield immediate improvements, they pave the way for future investigations into hybrid approaches that harness high-quality reasoning data to enrich the iterative thinking process.

Conclusion

In conclusion, "Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking" presents a compelling advancement in LLM reasoning methodologies. By introducing Multi-round Thinking, the paper offers a practical, efficient, and broadly applicable strategy for enhancing model reasoning capabilities, drawing parallels to human cognitive processes. The approach not only demonstrates measurable improvements across challenging benchmarks but also suggests promising avenues for further exploration in test-time scaling techniques. As LLMs continue to evolve, strategies like Multi-round Thinking are poised to play pivotal roles in refining AI reasoning, expanding both its theoretical foundation and practical applications.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1904723779223167114

https://twitter.com/TheTuringPost/status/1907074483556589953

https://twitter.com/ceobillionaire/status/1904895722559877517

https://twitter.com/Synced_Global/status/1904735434984612146

https://twitter.com/GptMaestro/status/1906444691081760918