Chain-of-Thought Depth

Updated 23 June 2026

Chain-of-thought depth is a measure of the sequential reasoning steps (explicit, computational, or latent) that models use to decompose and solve complex problems.
It enables serial computation, allowing shallow architectures to simulate deeper reasoning processes while balancing cost, expressivity, and transparency.
Empirical and theoretical analyses reveal optimal depth regimes and trade-offs, guiding adaptive depth selection to avoid overthinking and maximize performance.

Chain-of-thought (CoT) depth quantifies the number of sequential reasoning steps taken—either in explicit natural language or latent/internal representation—when decomposing a complex problem using LLMs or related architectures. CoT depth determines both the expressive power and cost-efficiency of inference, as well as the interpretability and reliability of a model’s output. Rigorous theoretical, algorithmic, and empirical analyses reveal the regimes where increased depth is beneficial (“serial computation”), when it induces overthinking or saturation, and how to tune or limit it for optimal performance.

1. Formal Definitions of Chain-of-Thought Depth

Depth in chain-of-thought can be precisely formalized in multiple settings:

Explicit CoT (textual reasoning): Given a reasoning trace of length $n$ , depth is the number of intermediate steps (sentences or tokens) produced between input and final answer. For tree-structured reasoning, depth $n$ corresponds to the number of levels from the prompt (root) to a leaf (answer), with branching factor (“degree”) $m$ , such that $m^n = N$ , where $N$ is the total number of candidate answers (Nadgir et al., 10 Apr 2026).
Computational depth augmentation: In constant-depth transformer architectures, each autoregressive CoT step simulates an iteration of serial computation. A transformer of depth $L$ that uses $T$ CoT steps achieves effective depth $L+T$ (Zhang et al., 2024, Li et al., 2024). Formally, CoT depth is the number of rounds an externalized or internalized hidden state is discretized and re-embedded into the sequence.
Latent/continuous CoT: In continuous reasoning, such as in diffusion models or latent CoT frameworks, depth is the number of internal latent state updates, denoted $\mathcal{T}$ or $K$ , where each iteration propagates or refines hidden “thought” vectors (Dai et al., 12 Mar 2026, Zhu et al., 18 May 2025, Li et al., 9 Feb 2026).
Composite frameworks: In multi-axis sampling (e.g., Fractured Sampling), depth is an explicit inference parameter $n$ 0, determining how far into the reasoning trace the process proceeds before truncation or answer extraction (Liao et al., 19 May 2025).

These definitions unify around the central idea that CoT depth reflects the length of the externally or internally traversed reasoning chain.

2. Depth, Serial Computation, and Expressive Power

Chain-of-thought depth directly enables models to solve tasks with high serial (step-wise) computational requirements even when the underlying architecture is shallow:

From AC⁰/TC⁰ to P/poly: Without CoT, constant-depth, constant-precision transformers can only compute functions in $n$ 1 (or $n$ 2 with logarithmic precision). Introducing $n$ 3 CoT steps enables the model to simulate size- $n$ 4 Boolean circuits, strictly increasing computational expressivity to $n$ 5, provided the step template at each CoT round correctly exposes the relevant latent state (Li et al., 2024, Zhang et al., 2024).
Necessity versus sufficiency of depth: For compositional reasoning questions (CRQs), shallow (e.g., depth-2) transformers require $n$ 6 CoT steps to solve size $n$ 7 instances, matching the inherent serial structure of, for example, tree-like Boolean formulae. With fewer than $n$ 8 steps, the model provably fails under standard hardness assumptions (Yehudai et al., 3 Mar 2025).
Trade-off between depth and CoT length: Effective reasoning depth can be supplied either by increased architectural depth or by extending the CoT chain; there is no shortcut when solving inherently serial or hierarchical problems (Li et al., 2024, Yehudai et al., 3 Mar 2025).

3. Theoretical Scaling Laws and Optimal Depth

The efficiency and accuracy of CoT reasoning scale nonlinearly with depth—both in explicit and latent regimes:

Error scaling law: For a classification with $n$ 9 candidates, splitting the task into $m$ 0 sequential subtasks each of degree $m$ 1 (with $m$ 2) reduces error as $m$ 3, where $m$ 4 is the intrinsic input dimension and $m$ 5 is the data count (Nadgir et al., 10 Apr 2026). There exists a critical degree $m$ 6 below which more steps are detrimental (overthinking), and an optimal depth $m$ 7 at which total error is minimized.
Phase transitions in depth: In solvable high-dimensional regression, theoretical analyses show sharp phase transitions: for low-quality pretraining or limited in-context data, increasing depth first reduces error (exponential decay), then yields polynomial improvements, but eventually leads to saturation or error amplification (“overthinking”). The optimal reasoning depth $m$ 8 grows rapidly approaching the phase boundary but becomes harmful in the overthinking regime (Takanami et al., 2 Jun 2026).
Fractured/incomplete CoT: Empirical evidence shows that truncating CoT traces to modest depths can achieve near-maximal accuracy, especially when the marginal benefit of further steps is low. Beyond a task-dependent depth, additional computation brings only diminishing or even negative returns (Liao et al., 19 May 2025, Cao et al., 28 Feb 2026).

4. Depth Adaptation, Budgeting, and Efficiency

Practical LLM systems must allocate depth adaptively:

Difficulty-aware reasoning: By explicitly correlating CoT depth to task difficulty (e.g. via automatic grading), models “think proportionally,” generating longer, multi-step chains for hard problems and minimal chains for easy ones. Empirically, proportional depth reduces token counts by 10–30% while maintaining or improving accuracy (Waheed et al., 5 Sep 2025).
Draft-style, dynamic, and instance-adaptive CoT: Efficient reasoning can be learned through curriculum learning with progressively increased maximum generation lengths, or directly through dynamic thresholding, where each step is valued by an importance or advantage score and reasoning halts once marginal utility falls below a threshold (Cao et al., 28 Feb 2026, Wang, 7 Feb 2025). This adaptivity strictly outperforms rigid, fixed-depth CoT on computational and time metrics.
Practical budgeting: Under small inference budgets (tokens/latency), depth allocation dominates accuracy gains versus other axes (trajectories, answer samples). Early stopping and truncation, especially when guided by output consistency, maximally conserve resources with minimal accuracy loss (Liao et al., 19 May 2025).

5. Latent Chain-of-Thought: Internal Depth and Causal Structure

Latent (or continuous) CoT approaches replace explicit text steps with a sequence of internal updates. The role and effect of depth in such settings exhibit non-trivial properties:

Direct correspondence to task complexity: For graph reachability, a two-layer transformer with $m$ 9 steps of continuous CoT can solve the problem for any diameter- $m^n = N$ 0 graph, as each latent update propagates a superpositional search frontier (Zhu et al., 18 May 2025).
Influence propagation and representational routing: Unlike explicit CoT, where influence flows locally step by step, latent CoT exhibits nonlocal information routing, with early latent steps influencing final outputs via strong skip connections. Causal interventions indicate that “staged” rather than uniformly-dialed depth is effective; often only specific steps are causally necessary (Li et al., 9 Feb 2026).
Empirical guideline: The minimal number of latent steps needed for >80 % accuracy on commonsense may be 3–4, while multi-step math problems require the full latent budget (e.g., 6–8 steps). Monitoring early answer decodability and stepwise flip rates can inform dynamic allocation (Li et al., 9 Feb 2026).

6. Architectural Considerations: Opaque Serial Depth and Model Constraints

Architectural restrictions impose hard limits on the unobservable (“opaque”) serial computation depth achievable without externalized CoT:

Opaque Serial Depth (OSD): OSD is the circuit-theoretic maximum sequence length of computation allowed between explicit, interpretable outputs (tokens). For transformer models, OSD sets an upper bound—e.g., for Gemma 3 1B, OSD ≤ 4490 at maximum sequence length—beyond which all reasoning must be surfaced as chain-of-thought (Brown-Cohen et al., 10 Mar 2026). Raising OSD via recursion, black-box memory, or latent CoT increases capacity but sacrifices transparency.
Trade-off between efficiency and interpretability: Deep explicit CoT externalizes all serial steps but is costly; latent reasoning (e.g., via layer recurrence or continuous diffusion) is computationally cheaper per token but often harder to probe and interpret (Lu et al., 2 Jul 2025, Dai et al., 12 Mar 2026). Empirical analyses on recurrent transformers indicate that increased recurrence depth yields only modest gains and does not recover the stepwise interpretability of explicit CoT (Lu et al., 2 Jul 2025).

7. Practical Recommendations for Depth Selection

Optimal CoT depth is task- and architecture-dependent but can be tuned using the following principles:

Estimate task intrinsic dimension and select subtask branching factor near $m^n = N$ 1; choose depth $m^n = N$ 2 so that $m^n = N$ 3 (Nadgir et al., 10 Apr 2026).
Adapt depth to difficulty: Use problem-adaptive or instance-adaptive prompting, reinforced dynamic thresholds, or difficulty-aware distillation pipelines (Waheed et al., 5 Sep 2025, Cao et al., 28 Feb 2026, Wang, 7 Feb 2025).
Favor shallow, high-quality CoT over deep, unselective reasoning for easy or homogeneous tasks; allocate greater depth for multi-step or serially-hard tasks (Takanami et al., 2 Jun 2026, Li et al., 9 Feb 2026).
Avoid overthinking: Monitor for diminishing returns or explicit overfitting/saturation regimes in high-depth tracing and prefer early stopping and truncation strategies (Takanami et al., 2 Jun 2026, Liao et al., 19 May 2025).
Incorporate supervision for step templates when prompt-space search is intractable or for algorithmic challenges, ensuring that each step meaningfully deepens the reasoning chain (Zhang et al., 2024).

These guidelines, derived from rigorous theory and broad empirical analysis, enable practitioners to tune chain-of-thought depth for maximal accuracy, efficiency, and interpretability.