Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When More is Less: Understanding Chain-of-Thought Length in LLMs (2502.07266v3)

Published 11 Feb 2025 in cs.AI, cs.CL, and cs.LG

Abstract: LLMs employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length's scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the "overthinking" phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.

Summary

  • The paper demonstrates that an optimal Chain-of-Thought length exists, beyond which performance declines due to noise accumulation.
  • It employs a theoretical model and empirical tests on synthetic arithmetic and real-world datasets to unveil a U-shaped performance curve.
  • The study introduces a novel inference strategy that selects optimal reasoning steps, enhancing LLM training and multi-step task efficiency.

The paper "When More is Less: Understanding Chain-of-Thought Length in LLMs" explores the nuanced relationship between the length of Chain-of-Thought (CoT) reasoning and the performance of LLMs. CoT involves breaking down complex tasks into smaller, sequential sub-tasks to enhance multi-step reasoning. While traditional approaches suggest that longer CoTs improve performance on challenging tasks, this paper identifies an inflection point where performance peaks and then declines with increasing CoT length.

Key Contributions

  1. Optimal Chain-of-Thought Length:
    • The paper demonstrates that there exists an optimal CoT length, at which point further increase deteriorates performance due to noise and error accumulation in longer reasoning chains.
    • The authors introduce a theoretical model that predicts this optimal CoT length based on model capability and task complexity, presenting a scaling law to describe this relationship.
  2. Empirical Evidence:
    • Experiments conducted on synthetic arithmetic datasets reveal a U-shaped performance curve, initially improving with increased CoT length but declining after exceeding an optimal point. For example, when evaluating arithmetic tasks, models with different layers (representing capacity) exhibit this trend consistently.
  3. Theoretical Analysis:
    • The authors formally prove that CoT processes are susceptible to increasing noise with longer chains and provide a mathematical framework to determine the optimal CoT length.
    • Proposition and theorem establish the existence of a unique CoT length that balances task decomposition and error management.
  4. Real-World Validation:
    • Experiments on real-world datasets like the MATH dataset using various LLMs reinforce the findings from synthetic data. These tasks showed similar trends where longer CoTs do not always improve performance, especially in larger models which handle complexity more efficiently in fewer steps.
  5. Influence on Model Training and Inference:
    • The paper highlights the significance of selecting optimal CoT lengths for model training and suggests that models trained with optimal CoT data outperform those trained on randomly selected CoT lengths.
    • A novel inference strategy named "Length-filtered Vote" is proposed to select the optimal CoT length during inference without needing extensive task or model evaluation, outperforming traditional majority vote methods.

Implications

The paper indicates that while longer CoTs can enhance reasoning for complex tasks, an optimal length should be carefully considered to avoid diminishing returns. The results underline the importance of aligning CoT length with specific model capabilities and task requirements. This insight provides a framework for better training and inference strategies in multi-step reasoning tasks, including LLM applications beyond synthetic settings.

This work suggests caution in assuming that increasing CoT step numbers universally leads to better performance and provides a robust theoretical basis for optimizing LLM reasoning capabilities.

Youtube Logo Streamline Icon: https://streamlinehq.com