- The paper shows that simply scaling Chain of Thought lengths can degrade LLM reasoning performance, identifying an optimal length exists that varies by problem domain.
- It proposes Thinking-Optimal Scaling (TOPS), a strategy where models learn adaptive reasoning intensity from seed data to find the shortest correct response iteratively.
- Quantitatively, TOPS-enhanced models outperform larger models and distillation methods on math benchmarks (GSM8K, MATH500, AIME2024) by reducing unnecessary complex reasoning.
The paper "Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning" explores the impact of scaling the Chain of Thoughts (CoTs) on the reasoning performance of LLMs, particularly in mathematical reasoning tasks. While extending CoTs has been recognized for improving complex reasoning tasks, the authors address an underexplored issue: whether excessive scaling of CoT lengths could degrade reasoning performance.
Key Findings
- Impact of Overextending CoTs:
- The paper reveals that longer CoTs can impair the reasoning performance of LLMs in certain domains. Through experiments on mathematical tasks, it is shown that there is an optimal scaled length for CoTs, which varies across different problem domains.
- Introducing Thinking-Optimal Scaling (TOPS):
- The authors propose a Thinking-Optimal Scaling strategy. This involves using a small set of seed data with varying response lengths to teach the model different reasoning efforts. Subsequently, the model applies different reasoning intensities to determine the shortest correct response across additional problems. This way, the LLM is self-improved based on a dataset curated from its reasoning distribution.
- Quantitative Results:
- Models enhanced with the TOPS strategy outperform existing distillation-based models like QwQ-32B-Preview and still other prevalent o1-like reasoning models across multiple math benchmarks such as GSM8K, MATH500, and AIME2024. Notably, the TOPS-trained models outperform other larger models with fewer tokens used, demonstrating enhanced efficiency by reducing unnecessary complex reasoning for simpler problems.
- Iterative Self-Improvement:
- The paper outlines an iterative self-improvement method where the initial TOPS-enhanced model generates multiple responses per problem. By selecting the shortest correct response, the model undergoes further training, leading to improvements in performance, especially on challenging datasets.
Methodology
- Preliminary Analysis: The paper first conducts an analysis on existing o1-like models to evaluate the effects and efficiency of current test-time scaling methodologies.
- Training Tag Models: New models are trained using data generated under controlled reasoning efforts (low, medium, high) to understand the impact of different length distributions.
- Performance Analysis: The authors perform a detailed critique of the scaling process, identifying that excessive erroneous steps in longer CoTs contribute to degraded performance.
- Model Evaluation: Various LLM reasoning tasks are employed, leveraging test-time compute scaling in benchmarks such as GSM8K, MATH500, and AIME2024.
Implications
This work emphasizes the significance of optimal compute distribution over mere extension of CoTs. The insights and strategies presented could have practical implications for enhancing the efficiency and reasoning capabilities of LLMs, particularly in domains where computational resources and time efficiency are pivotal.
Overall, the paper contributes a nuanced understanding of CoT scaling, challenging the notion that longer always means better in the context of LLM reasoning enhancement. The TOPS strategy presents a promising avenue for achieving optimal test-time performance through adaptive reasoning efforts.