Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones (2505.21825v1)

Published 27 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Inference-time computation has emerged as a promising scaling axis for improving LLM reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of LLMs, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

Summary

A Formal Examination of Sequential vs. Parallel Inference Scaling in LLMs

The paper, "Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones," addresses a critical problem in inference-time computation for LLMs: the optimal allocation of compute resources. It investigates whether to prioritize sequential scaling, which involves longer chains of thought, or rely on parallel scaling, using methods such as majority voting across multiple short chains of thought.

Background and Motivation

Recent advances in LLMs have shifted from traditional axes of scaling, such as model size and training data, to increasing compute at inference time. This is particularly important in reasoning tasks, where models like OpenAI's o-series and DeepSeek-R1 have shown impressive performance. However, a fundamental question remains open: how should inference-time compute be optimally allocated?

Core Contributions

The paper introduces a reasoning task based on graph connectivity to demonstrate that sequential scaling can offer exponential advantages over parallel scaling. This argument is substantiated with both theoretical insights and empirical validations.

Theoretical Results:
- It is shown that, under certain settings, sequential scaling with a polynomial-length chain of thought can solve graph connectivity problems, whereas parallel scaling by aggregating over polynomially-many short chains of thought cannot. This result relies on theoretical foundations related to the expressivity of transformers and complexity-theoretic assumptions.
- The authors present a "Vertex Query Model" of computation to abstractly represent transformer operations in multi-hop reasoning tasks. Analysis on this model suggests that certain graph structures pose challenges that are most efficiently solved by sequentially scaling the computation.
Empirical Validation:
- Experiments using various large-scale LLMs, including those trained specifically for reasoning tasks, illustrate the practical benefits of sequential scaling. For example, in "bridge graph" scenarios, substantial sequential scaling is needed before parallel methods like majority voting become effective.
- The paper explores the impact of reinforcement learning (RL) on chain-of-thought generation, revealing that RL can effectively adjust the length and strategy of explanations, enhancing performance in reasoning tasks.

Implications and Future Directions

The findings have significant implications for the development of future AI systems. They suggest that when addressing complex reasoning or graph-based tasks, leveraging long chains of thought could provide efficiency and performance advantages. However, this paper also recognizes the potential diminishing returns of sequential scaling past a certain threshold, noting that at high levels of sequential compute, the additional benefits of parallel scaling can become more pronounced.

Theoretically, this research advances our understanding of the computational power of LLMs, especially in the domain of sequential reasoning. Practically, it highlights the need for more nuanced test-time scaling strategies that balance between extending chain-of-thought and aggregating diverse outputs.

Future research could further investigate the conditions under which one scaling method may surpass the other across different types of tasks and datasets. It might also explore how LLM architectures can be designed or tuned to leverage the benefits of both sequential and parallel strategies more effectively.

In sum, this paper provides a comprehensive analysis of inference-time computation strategies, offering insights that could guide the development and deployment of more resource-efficient and effective LLMs.

Tweets

https://twitter.com/fly51fly/status/1928205120795193544

https://twitter.com/KempnerInst/status/1940849882660851996