Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

Published 13 Apr 2026 in cs.LG | (2604.12013v1)

Abstract: Modern LLMs generate text autoregressively, producing tokens one at a time. To study the learnability of such systems, Joshi et al. (COLT 2025) introduced a PAC-learning framework for next-token generators, the primitive underlying autoregressive models. In this framework, an unknown next-token generator maps a sequence of tokens to the next token and is iteratively applied for $T$ steps, producing a chain of tokens whose final token constitutes the model's output. The learning task is to learn the input-output mapping induced by this autoregressive process. Depending on the available supervision, training examples may reveal only the final output (End-to-End supervision) or the entire generated chain (Chain-of-Thought supervision). This raises two natural questions: how the sample complexity depends on the generation length $T$, and how much Chain-of-Thought supervision can reduce this dependence. In this work we give a nearly complete answer to both questions by uncovering a taxonomy of how the sample complexity scales with $T$. For End-to-End learning, we show that the landscape is remarkably rich: subject to mild conditions, essentially any growth rate $r(T)$ between constant and linear can arise as the sample complexity, and combined with the linear upper bound of Joshi et al., this yields an essentially complete characterization. In contrast, under Chain-of-Thought supervision we show that the sample complexity is independent of $T$, demonstrating that access to intermediate reasoning steps can eliminate the dependence on the generation length altogether. Our analysis introduces new combinatorial tools, and as corollaries we resolve several open questions posed by Joshi et al. regarding the dependence of learnability on the generation length and the role of Chain-of-Thought supervision.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper establishes that for finite VC dimension, chain-of-thought supervision achieves sample complexity independent of the reasoning chain length T.
It demonstrates that end-to-end supervision exhibits a full spectrum of scaling behaviors—from constant to linear—depending on the base class complexity.
A key contribution is proving that no universal combinatorial dimension can capture sublinear E2E rates, while introducing the autoregressive tree dimension for logarithmic dependence.

Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End

Introduction and Theoretical Context

This work rigorously analyzes the PAC sample complexity of learning autoregressive sequence generators, focusing on two core supervision paradigms: End-to-End (E2E) learning, where only the final output is observed, and Chain-of-Thought (CoT) learning, where the full reasoning sequence is revealed during training. The framework is formalized using next-token generators, with an emphasis on binary output spaces (though the theory extends to finite alphabets) and investigates how the growth of sample complexity with the autoregressive chain length $T$ varies across supervision regimes.

The study advances the theoretical understanding established in prior work [e.g., Joshi et al., COLT 2025], resolving several open questions related to the scaling laws of sample complexity in these settings.

Formal Model

Let $\Sigma$ be a finite alphabet, and $f\colon \Sigma^\star \to \Sigma$ a next-token generator. The generator is unrolled for $T$ autoregressive steps, producing a sequence (interpreted as a chain of thought) and a final token (the output). Two task-induced function classes of interest are:

$\mathcal{F}^{\mathsf{CoT},T}$ : maps $x \mapsto$ the full $T$ -step chain, i.e., sequence of intermediate tokens.
$\mathcal{F}^{\mathsf{e2e},T}$ : maps $x \mapsto$ the final token after $T$ autoregressive steps.

In both learning protocols, examples are drawn i.i.d. from an unknown distribution over prompts. The signal given to the learner differs depending on the supervision:

End-to-End: Only $\Sigma$ 0 observed per training sample.
Chain-of-Thought: Entire sequence $\Sigma$ 1 revealed, i.e., all intermediate tokens.

The key quantity is the PAC sample complexity as a function of $\Sigma$ 2 and the base class complexity (VC dimension).

Main Results and Technical Contributions

1. Sample Complexity with Chain-of-Thought Supervision

Core Theorem: For any base class $\Sigma$ 3 of finite VC dimension, the sample complexity for achieving accuracy $\Sigma$ 4 and confidence $\Sigma$ 5 in the Chain-of-Thought regime is independent of the chain length $\Sigma$ 6.

Formally,

$\Sigma$ 7

where $\Sigma$ 8 (with $\Sigma$ 9 the dual VC dimension).

This result refines the previous state-of-the-art, which established only logarithmic dependence on $f\colon \Sigma^\star \to \Sigma$ 0 [Joshi et al.], demonstrating that full CoT supervision removes all dependence on reasoning chain length when the base class has finite VC dimension.

A compression-based multiclass reduction is leveraged rather than a naive consistent learner, yielding the improvement; thus, for broad classes of practical interest (e.g., linear predictors), optimal sample complexity scaling in CoT can be asserted.

2. Sample Complexity for End-to-End Supervision: Taxonomy of Possible Growth Rates

In the E2E regime, the sample complexity is characterized via the VC-dimension of $f\colon \Sigma^\star \to \Sigma$ 1. The main contribution here is a fine-grained taxonomy:

For every monotone subadditive function $f\colon \Sigma^\star \to \Sigma$ 2 (e.g., polylogarithmic, root, polynomial, or linear), there exists a class with E2E sample complexity that scales as $f\colon \Sigma^\star \to \Sigma$ 3.

This means that, within the domain of PAC-learnable base classes, all intermediate scaling behaviors between constant and linear in $f\colon \Sigma^\star \to \Sigma$ 4 are attainable. No dichotomy (e.g., logarithmic vs. linear) exists; the sample complexity landscape is provably rich.

Implication

The empirical and theoretical literature has often highlighted linear dependence as canonical; this work demonstrates that in fact, a continuum of rates is possible for E2E learning, given appropriate construction of base classes.

3. Impossibility of a Complete Dimension Theoretic Characterization for Sublinear Rates

The above richness immediately motivates the following: Can there exist a combinatorial dimension that precisely separates sublinear vs. linear (or superlinear) sample complexity growth in E2E learning? The paper shows, via a diagonalization argument, that the answer is negative.

No combinatorial dimension (subject to mild criteria) can characterize sublinear E2E sample complexity behavior, nor provide tight quantitative (rate-based) upper bounds within the sublinear regime.

This resolves a principal open question and formalizes the limits of dimension-based sample complexity theory in the autoregressive E2E setting.

4. Sufficient Condition for Logarithmic E2E Sample Complexity

Nonetheless, the authors provide a sufficient, strictly broader condition than finite Littlestone dimension for logarithmic $f\colon \Sigma^\star \to \Sigma$ 5-dependence, based on a newly introduced "autoregressive tree dimension" (ATdim). If the induced prefix tree over generations admits no large leveled perfect binary subtree, E2E sample complexity grows only logarithmically.

This condition can hold in cases where Littlestone dimension diverges, thus expanding the known class of "efficiently learnable" autoregressive generator families with sublinear scaling.

5. Pathologies and Limiting Results for Infinite VC Base Classes

For base classes with infinite VC dimension, the work constructs pathological examples. In particular, for some classes, learnability and sample complexity can alternate (e.g., be zero for all even $f\colon \Sigma^\star \to \Sigma$ 6, infinite for all odd $f\colon \Sigma^\star \to \Sigma$ 7), ruling out any regular characterization without base class complexity constraints.

Methodological Insights

The analysis makes key use of:

Explicit combinatorial constructions to "engineer" desired VC-growth behavior
Reductions to multiclass learning and application of sample compression bounds
Tree-based combinatorial arguments for capturing scaling in the realized sequence space, allowing improved bounds in favorable regimes
Tight lower bounds leveraging shattering arguments and reductions to classical learning hardness

Implications and Directions for Future Work

Practical Implications

Full Chain-of-Thought supervision is maximally statistically efficient for finite-VC generator classes; increasing the chain length has no deleterious effect on data efficiency.
In contrast, End-to-End supervision's information bottleneck can drive substantial sample complexity penalties, which may be arbitrarily large depending on the structure of the underlying generator class.
For practical model design, the result underlines the critical benefit of fine-grained intermediate supervision and motivates dataset annotation protocols that permit this.

Theoretical Implications

The fine taxonomy of E2E complexity rates and non-existence of a universal dimension-theoretic characterization have consequences for both structural learning theory and the development of complexity measures in sequential/recursive contexts.
Tree-based combinatorial conditions (such as ATdim) can be used to classify new, practically relevant model classes that could admit efficient E2E learning despite lack of finite Littlestone dimension.

Open Questions

Extension to infinite alphabets and multilabel output spaces (e.g., in proof synthesis or program induction contexts) is non-trivial, and classical symmetries of PAC learning fail.
Relaxed base class assumptions (notably, dropping finite VC) are shown to admit pathological behaviors; thus, understanding necessary conditions for non-pathological sample complexity remains open.
Improved explicit constants and sharp dependence on $f\colon \Sigma^\star \to \Sigma$ 8 and class parameters in CoT regimes for wide classes of concept classes (beyond stable compression) are of practical interest.

Conclusion

This work provides a comprehensive theoretical classification of the sample complexity landscape for learning autoregressive next-token generators under different supervision paradigms. For finite-VC classes, Chain-of-Thought supervision completely eliminates chain-length dependence, while End-to-End supervision admits a spectrum of possible sample complexity growth behaviors, with no universal dimension-theoretic governing principle. Furthermore, the introduction of autoregressive tree dimension expands understanding of efficient E2E learning conditions. The analysis sets a new bar for theoretical work on sequence model learnability and informs both practical algorithm design and annotation strategies for strong statistical efficiency in autoregressive models.

Markdown Report Issue