Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning (2502.18080v1)

Published 25 Feb 2025 in cs.CL and cs.AI

Abstract: Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of LLMs, we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.

Summary

The paper shows that simply scaling Chain of Thought lengths can degrade LLM reasoning performance, identifying an optimal length exists that varies by problem domain.
It proposes Thinking-Optimal Scaling (TOPS), a strategy where models learn adaptive reasoning intensity from seed data to find the shortest correct response iteratively.
Quantitatively, TOPS-enhanced models outperform larger models and distillation methods on math benchmarks (GSM8K, MATH500, AIME2024) by reducing unnecessary complex reasoning.

The paper "Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning" explores the impact of scaling the Chain of Thoughts (CoTs) on the reasoning performance of LLMs, particularly in mathematical reasoning tasks. While extending CoTs has been recognized for improving complex reasoning tasks, the authors address an underexplored issue: whether excessive scaling of CoT lengths could degrade reasoning performance.

Key Findings

Impact of Overextending CoTs:
- The paper reveals that longer CoTs can impair the reasoning performance of LLMs in certain domains. Through experiments on mathematical tasks, it is shown that there is an optimal scaled length for CoTs, which varies across different problem domains.
Introducing Thinking-Optimal Scaling (TOPS):
- The authors propose a Thinking-Optimal Scaling strategy. This involves using a small set of seed data with varying response lengths to teach the model different reasoning efforts. Subsequently, the model applies different reasoning intensities to determine the shortest correct response across additional problems. This way, the LLM is self-improved based on a dataset curated from its reasoning distribution.
Quantitative Results:
- Models enhanced with the TOPS strategy outperform existing distillation-based models like QwQ-32B-Preview and still other prevalent o1-like reasoning models across multiple math benchmarks such as GSM8K, MATH500, and AIME2024. Notably, the TOPS-trained models outperform other larger models with fewer tokens used, demonstrating enhanced efficiency by reducing unnecessary complex reasoning for simpler problems.
Iterative Self-Improvement:
- The paper outlines an iterative self-improvement method where the initial TOPS-enhanced model generates multiple responses per problem. By selecting the shortest correct response, the model undergoes further training, leading to improvements in performance, especially on challenging datasets.

Methodology

Preliminary Analysis: The paper first conducts an analysis on existing o1-like models to evaluate the effects and efficiency of current test-time scaling methodologies.
Training Tag Models: New models are trained using data generated under controlled reasoning efforts (low, medium, high) to understand the impact of different length distributions.
Performance Analysis: The authors perform a detailed critique of the scaling process, identifying that excessive erroneous steps in longer CoTs contribute to degraded performance.
Model Evaluation: Various LLM reasoning tasks are employed, leveraging test-time compute scaling in benchmarks such as GSM8K, MATH500, and AIME2024.

Implications

This work emphasizes the significance of optimal compute distribution over mere extension of CoTs. The insights and strategies presented could have practical implications for enhancing the efficiency and reasoning capabilities of LLMs, particularly in domains where computational resources and time efficiency are pivotal.

Overall, the paper contributes a nuanced understanding of CoT scaling, challenging the notion that longer always means better in the context of LLM reasoning enhancement. The TOPS strategy presents a promising avenue for achieving optimal test-time performance through adaptive reasoning efforts.

PDF Markdown

Tweets

https://twitter.com/fly51fly/status/1894866796395798777