Shorter Thinking Chains in LMs

Updated 23 October 2025

Shorter thinking chains are reasoning trajectories that use compressed intermediate steps to balance accuracy and computational efficiency in language models.
Techniques such as reinforcement learning with length penalties, supervised fine-tuning, and adaptive mode selection dynamically adjust the chain length to match task complexity.
These strategies improve model responsiveness by reducing error accumulation, computational load, and latency, making them ideal for resource-constrained deployments.

Shorter thinking chains are reasoning trajectories in LLMs that accomplish complex tasks using compressed or adaptively concise sequences of intermediate steps, rather than long, verbose chains. Advances in this area seek to optimize the trade-off between accuracy and computational efficiency by curtailing unnecessary reasoning, selectively compressing step-by-step explanations, or dynamically modulating process depth according to problem complexity and model capability.

1. Theoretical Foundations: Optimal Chain Length and Error Accumulation

The assumption that longer chain-of-thought (CoT) sequences always yield better reasoning has been challenged through both empirical and theoretical analysis. Experimental results on synthetic and real-world tasks reveal that task accuracy typically follows an inverted U-shaped curve with increasing CoT length: accuracy initially rises with stepwise decomposition but eventually degrades as the chain grows longer and error accumulates at each step. This dependence is formalized as

$A(N) = \alpha \left[(1 - E(N, M, T))(1 - \sigma(T))\right]^N,$

where $A(N)$ is the final accuracy with $N$ steps, $E(\cdot)$ is the subtask error (a function of chain length, model capability $M$ , and task complexity $T$ ), and $\sigma(T)$ is the noise or extraction error for decomposed subtasks. The optimal number of reasoning steps $N^*$ increases with task difficulty but decreases with model capability, exposing a simplicity bias whereby stronger models favor shorter, more efficient reasoning chains (Wu et al., 11 Feb 2025). Training and inference strategies that calibrate CoT length to an optimal value (e.g., length-filtered voting) can thus outperform random or overly verbose reasoning strategies.

Furthermore, for some problems—most notably multi-hop graph connectivity—theoretical analysis has established that long, sequential chains yield exponentially greater expressive power than massively parallel short chains. Specifically, tasks requiring sequential exploration cannot be solved reliably by aggregating polynomially many constant-length outputs, reinforcing the necessity of adaptively scaling chain length to the computational structure of the problem (Mirtaheri et al., 27 May 2025).

2. Compression and Distillation: Techniques for Reducing Chain Length

Methods for producing shorter thinking chains comprise compression, distillation, and selective token skipping, often applied in settings where models would otherwise default to long chains. Several effective strategies are catalogued in recent surveys (Feng et al., 15 Apr 2025, Zhu et al., 13 Jul 2025):

Reinforcement Learning with Length Penalties: Training involves explicit reward functions that penalize unnecessary chain length while rewarding correct solutions. Implementations include O1-Pruner, DAST, and THINKPRUNE, which dynamically prune excessive tokens or assign kinetic budgets during inference.
Supervised Fine-Tuning with Variable-Length CoT Data: Long teacher-generated rationales are post-processed to generate “compressed” versions, using techniques such as TokenSkip (eliminating redundant tokens), C3oT (compressors like GPT-4), and binary search strategies (TALE) to find minimal sufficient explanations.
Latent Reasoning: Models may be trained to perform implicit internal computation without emitting explicit step-by-step explanations, effectively compressing reasoning into hidden states and reducing the number of output tokens (Feng et al., 15 Apr 2025).

For small models, chunk-wise chain-of-thought distillation and skip-thinking training are especially effective. In chunk-wise training (CWT), long rationales are split into semantically coherent chunks, with training focused on only one chunk per iteration to avoid over-smoothing gradients of core reasoning tokens. Skip-thinking training (STT) further enables models to skip non-reasoning chunks during inference, yielding both faster and, often, more accurate output (Chen et al., 24 May 2025).

3. Adaptive Reasoning: Dynamic Shortening Based on Problem Difficulty

Shorter thinking chains can be made adaptive to task complexity, yielding models that "think minimally" for simple problems and reserve deep chains for genuinely complex reasoning. This is accomplished through:

Difficulty-Aware Distillation: Assigning each problem a scalar difficulty (e.g., via LLM-based grading) and compressing reasoning length so that $\mathbb{E}[T \mid d(x)] \propto d(x)$ . The model is post-trained using supervised fine-tuning and direct preference optimization—preferring short, correct outputs and penalizing excessive verbosity—without altering model architecture (Waheed et al., 5 Sep 2025).
Explicit Block-Structured Reasoning: The "Think in Blocks" paradigm partitions the reasoning process into discrete blocks, with the model explicitly predicting the required number of blocks according to problem complexity. This allows for user-level or model-level control over the depth of reasoning at inference, reducing reasoning depth for simple tasks and expanding it only as necessary (Zhu et al., 21 Aug 2025).
Adaptive Mode Selection via Reinforcement Learning: Combining reward shaping (favoring short or long chains based on the sampled accuracy of short-chain outputs) with logit-based reasoning mode selection loss, models autonomously switch between concise and deep reasoning chains depending on difficulty (Wang et al., 26 May 2025).

The result of these approaches is models that dynamically minimize computation for simple tasks and allocate resources in proportion to the reasoning required for more complex queries.

4. Serving and Inference-Time Strategies: Efficient Selection and Early Stopping

At inference time, several serving frameworks and generation strategies have been introduced to operationalize shorter thinking chains:

Redundant Sampling with Early Stopping: Methods like SART sample more reasoning branches than needed and terminate once a fixed number of sufficiently concise branches are completed. By leveraging order statistics, this approach skews the winning ensemble toward shorter branches, reducing both computation and latency. Dynamic, two-phase pruning further eliminates low-quality, slow branches on the fly, enhancing batch efficiency without accuracy loss (Wang et al., 19 May 2025).
Short- $m$ @ $k$ Majority Voting: Instead of waiting for all $k$ chains to finish, computation is halted once the first $m$ chains have completed, and the answer is selected by majority vote among these, preferring the shortest candidate in the event of a tie. Experiments show accuracy gains up to 34.5% over using the longest chain, with up to 40% fewer thinking tokens and 33% wall-time reduction (Hassid et al., 23 May 2025).
Connector-Aware Compact Chains: Structured use of connector phrases constrains the expansion of reasoning traces, with explicit rules forbidding consecutive connectors or unnecessary validation loops. Compact CoT (CAC-CoT) yields average traces only one-third as long as baselines while maintaining or improving accuracy across both slow ("System-2") and fast ("System-1") cognitive benchmarks (Choi et al., 26 Aug 2025).

These strategies improve both throughput and resource efficiency in real-world deployments, especially for applications where response time and memory footprint are operational constraints.

5. The Role of Exemplars, Priors, and Prompt Engineering

The ability to induce appropriately short chains depends on the interplay between in-context exemplars and pretrained model priors. Fine-grained lexical analyses demonstrate that model outputs reflect both the local structure of exemplars and the deep reasoning priors pretrained into the model (Yang et al., 1 Sep 2025). Providing high-quality, task-tailored exemplars steers models toward producing concise and accurate chains, while excessive or noisy exemplars can shift outputs toward instability or excessive verbosity.

Prompt engineering is also effective: prompts such as "Be concise" or "Skip steps," as well as heuristic zero-shot shortcuts (e.g., "Quick Conclude” prompts), can reliably induce the production of short, effective chains or bypass unnecessary intermediate steps altogether (Ding et al., 4 Jun 2024, Zhu et al., 13 Jul 2025).

6. Practical Impact and Open Challenges

Consistent findings across diverse reasoning tasks show that shorter, well-structured chains yield substantial gains in computational efficiency, latency, and, surprisingly, accuracy—particularly by minimizing overthinking and its associated error accumulation (Wu et al., 11 Feb 2025, Hassid et al., 23 May 2025). Shorter chains are more interpretable and more amenable to deployment in environments with strict resource constraints. For some classes of problems—particularly those requiring explicit sequential exploration (e.g., multi-hop graph connectivity)—fully compressed chains may be inadequate, so adaptivity is crucial (Mirtaheri et al., 27 May 2025).

Research continues on challenges such as: (1) striking the optimal balance between brevity and completeness for a given model-task pair, (2) preventing loss of critical logical steps when aggressively compressing chains, (3) integrating user preferences for explanation detail, and (4) designing trustworthy, robust strategies for chain selection and length calibration in both training and deployment (Zhu et al., 13 Jul 2025, Cuesta-Ramirez et al., 1 Jul 2025).

7. Methodological Summary Table

Approach	Core Mechanism	Impact on Chain Length/Accuracy
RL with Length Penalty	Explicit brevity rewards	Shorter, correct traces
SFT/DPO on Variable Chains	Fine-tuning with compressed data	Reduces length, maintains/improves acc.
Short- $m$ @ $k$ Voting	Early stopping in parallel	Fewer tokens, higher accuracy
Difficulty-Aware Distillation	Chain length ∝ task difficulty	Proportional reasoning, lower cost
Connector-Aware Prompting	Structure via connectors	Average 1/3 token count, high acc.
Adaptive Mode RL	Balances short/long via rewards	Auto-switching, minimal overthinking
Chunk-wise Training	Isolate core steps, skip summary	Faster, more accurate SLMs

All evidence converges on the conclusion that shorter thinking chains, when matched to input complexity and model capability through targeted algorithmic interventions, sustain or improve reasoning performance while yielding substantial gains in efficiency, interpretability, and deployability.