Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? (2502.12215v2)
Abstract: The advent of test-time scaling in LLMs, exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.
Summary
- The paper challenges the assumption that increasing Chain-of-Thought length consistently improves accuracy in o1-like models like QwQ and R1, finding performance doesn't continuously improve.
- Insufficient self-revision capabilities and the tendency for correct solutions to be shorter than incorrect ones are key reasons for the failure of sequential scaling in these models.
- The authors propose Shortest Majority Vote, a parallel scaling method that outperforms conventional majority voting by prioritizing shorter solutions, significantly improving test-time scalability.
The paper "Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?" investigates the test-time scaling capabilities of LLMs such as QwQ, Deepseek-R1 (R1), and LIMO, which are designed to replicate the performance of the OpenAI o1 series. The paper challenges the assumption that increasing the Chain-of-Thought (CoT) length during inference consistently improves accuracy in these models.
The authors discover that longer CoTs do not always enhance accuracy, and correct solutions are often shorter than incorrect ones for the same questions. They attribute this phenomenon to the models' self-revision capabilities, where longer CoTs contain more self-revisions that can lead to performance degradation. The paper compares sequential and parallel scaling strategies on QwQ, R1, and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these findings, the authors propose "Shortest Majority Vote," a method that combines parallel scaling strategies with CoT length characteristics, improving models' test-time scalability compared to conventional majority voting approaches.
The paper begins by questioning whether models like QwQ, R1, and LIMO truly possess test-time scaling capabilities, where performance consistently improves with longer CoTs. The authors systematically investigate the relationship between CoT length and reasoning performance, challenging the assumption that extended reasoning chains inherently lead to improved accuracy. Their analysis reveals that longer CoTs do not consistently improve accuracy and that the average length of correct solutions is shorter than that of incorrect ones.
To understand why longer CoTs do not lead to better performance, the authors compare long CoTs with short CoTs, finding that long CoTs contain more self-revisions. They iteratively prompt QwQ, R1, and LIMO for more self-revisions and observe that some models exhibit performance degradation as the length of reflection increases, while others show initial improvements followed by oscillatory behavior. The authors evaluate the models' capacity to revise incorrect answers, finding that QwQ, R1, and LIMO demonstrate limited ability to convert incorrect answers to correct ones during the revision process.
Given the limited effectiveness of sequential scaling, the authors explore parallel scaling. Their comparative analysis reveals that parallel scaling achieves better coverage and scalability than sequential scaling for QwQ and R1, indicating that these models have limited sequential-scaling capability but strong parallel-scaling capability.
Building on these findings, the authors propose Shortest Majority Vote, a test-time scaling method that incorporates parallel scaling approaches with insights on sequential scaling. This method prioritizes clusters that have more solutions and shorter solution lengths. Experimental results demonstrate that Shortest Majority Vote substantially outperforms conventional Majority Vote, significantly improving the test-time scalability of both QwQ and R1 models.
Key findings include:
- The performance of o1-like models QwQ, R1, and LIMO cannot be continuously improved by increasing CoT length.
- Insufficient self-revision capability is a primary reason for the failure of sequential scaling in these models.
- Parallel scaling achieves better coverage and scalability than sequential revision.
The paper evaluates the models on MATH-500, AIME, Omini-MATH, and GPQA benchmarks. The models were run using the SGLang framework, with the sampling temperature set to 0.7 and the maximum generation length set to 32k. For AIME evaluation, the AIMO validation set was used, comprising 90 questions. For GPQA, the diamond subset containing 198 questions was used.
The paper investigates whether the accuracy of QwQ, R1, and LIMO genuinely improves with increasing CoT length. The authors sample each model five times on the same question and sort the five solutions by length in ascending order. They group the solutions based on their rank in this sorted list, with the i-th ranked solutions forming a distinct group.
The average lengths of the five groups of solutions are compared. The average length of the longest solutions is approximately twice that of the shortest solutions, indicating that long-CoT models like QwQ, R1, and LIMO exhibit high diversity in the lengths of the solutions they sample.
The accuracy of the five groups of solutions is also analyzed. The results indicate that there is no clear correlation between the length of solutions and the model's size. A comparison of solution lengths across different datasets shows that solutions for simpler datasets, such as Math, are significantly shorter than those for more difficult datasets, like AIME, suggesting that the model adjusts the solution length based on the difficulty of the problem. There is no consistent improvement in accuracy for either QwQ or R1 as solution length increases, and in some cases, accuracy decreases with increasing CoT length, especially on more difficult datasets.
To clarify the relationship between CoT length and accuracy, the authors compare the lengths of correct and incorrect solutions for the same question. They identify questions that have both correct and incorrect answers and calculate the average length of correct and incorrect solutions for each question. The results show that, for QwQ, R1, and LIMO, across all model sizes and datasets, the length of correct solutions is consistently shorter than that of incorrect solutions.
The paper investigates the reasons for the phenomenon that long solutions exhibit lower accuracy compared to short solutions. The authors analyze how the maximum token limitation affects generation performance and confirm that the observed invalid scaling phenomenon was not caused by constraints in the maximum token length. They examine the differences between long and short solutions, finding that long solutions exhibit a higher frequency of self-revision, and suggesting a strong correlation between self-revision, solution length, and accuracy.
The max token limitation parameter controls the maximum number of tokens a model can generate for a question, which plays a critical role in influencing model accuracy, especially when generating long solutions. The results reveal that 16k is a key threshold: when the max token limitation is below this value, it significantly affects the model performance. However, increasing the max token limitation beyond 16k leads to diminishing returns, particularly for QwQ.
To understand why long solutions of QwQ, R1, and LIMO are not better than short solutions, the authors analyze their differences. They observe that QwQ, R1, and LIMO all primarily extend solution length through self-revision, characterized by markers such as "Wait" and "Alternatively". The results demonstrate a strong linear correlation between solution length and the frequency of self-correction markers for all models, suggesting that the mechanisms of self-revision may play a significant role in generating longer solutions.
The authors prompt the models to continue thinking based on their sampled solutions. To facilitate smoother continuation of the reasoning process, they remove the "final answer" portion from the solutions and use the keyword "Wait" or "Alternatively" as the prompt to encourage self-revision. The probabilities of the model predicting the next token as "Wait" or "Alternatively" are calculated, and the one with the higher probability is selected as the prompt. The accuracy after sequential revision is analyzed. The results reveal that the accuracy of QwQ and R1-Distill-1.5b decreases constantly as the number of reasoning steps increases, while the accuracy of R1-Distill-32b, R1-Distill-14b, and LIMO initially improves and then oscillates with further reasoning steps.
To further investigate the effectiveness of self-revision, the authors analyze the proportion of cases where the model corrected an initial incorrect answer to a correct one versus changing an initial correct answer to an incorrect one during scaling solution length. They find that the proportions of changing an incorrect answer to a correct one is extremely low, always below 10\%.
The authors compare the performance of sequential scaling and parallel scaling in terms of the coverage (pass@k score) and accuracy of QwQ and R1. For sequential scaling, they iteratively prompt models to self-revise for 40 steps. For parallel scaling, they parallely sample 10 solutions. The coverage is evaluated by counting the proportion of whether multiple candidate answers contain a correct one. The findings show that, for the same number of generated tokens, parallel scaling provides a significantly larger improvement in coverage compared to sequential scaling, for both R1-Distill-32b and QwQ.
Given the limitation of sequential scaling of the current o1-like models, the authors turn to parallel scaling techniques and incorporate them with insights on sequential scaling and propose a new Parallel Scaling algorithm: Shortest Majority Vote.
Let the number of solutions in the i-th category be ci and the average solution length in that category be li. The score for category i in Shortest Majority Vote is computed as:
si=loglici
where:
- si is the score for category i
- ci is the number of solutions in the i-th category
- li is the average solution length in the i-th category
The authors evaluate the performance of Shortest Majority Vote and Majority Vote through experiments on the AIME and GPQA benchmarks, sampling 16 solutions from QwQ, R1, and LIMO models. The experimental results demonstrate that Shortest Majority Vote significantly outperforms both Majority Vote and Shortest methods, particularly on the AIME benchmark.