- The paper establishes that for finite VC dimension, chain-of-thought supervision achieves sample complexity independent of the reasoning chain length T.
- It demonstrates that end-to-end supervision exhibits a full spectrum of scaling behaviors—from constant to linear—depending on the base class complexity.
- A key contribution is proving that no universal combinatorial dimension can capture sublinear E2E rates, while introducing the autoregressive tree dimension for logarithmic dependence.
Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End
Introduction and Theoretical Context
This work rigorously analyzes the PAC sample complexity of learning autoregressive sequence generators, focusing on two core supervision paradigms: End-to-End (E2E) learning, where only the final output is observed, and Chain-of-Thought (CoT) learning, where the full reasoning sequence is revealed during training. The framework is formalized using next-token generators, with an emphasis on binary output spaces (though the theory extends to finite alphabets) and investigates how the growth of sample complexity with the autoregressive chain length T varies across supervision regimes.
The study advances the theoretical understanding established in prior work [e.g., Joshi et al., COLT 2025], resolving several open questions related to the scaling laws of sample complexity in these settings.
Let Σ be a finite alphabet, and f:Σ⋆→Σ a next-token generator. The generator is unrolled for T autoregressive steps, producing a sequence (interpreted as a chain of thought) and a final token (the output). Two task-induced function classes of interest are:
- FCoT,T: maps x↦ the full T-step chain, i.e., sequence of intermediate tokens.
- Fe2e,T: maps x↦ the final token after T autoregressive steps.
In both learning protocols, examples are drawn i.i.d. from an unknown distribution over prompts. The signal given to the learner differs depending on the supervision:
- End-to-End: Only Σ0 observed per training sample.
- Chain-of-Thought: Entire sequence Σ1 revealed, i.e., all intermediate tokens.
The key quantity is the PAC sample complexity as a function of Σ2 and the base class complexity (VC dimension).
Main Results and Technical Contributions
1. Sample Complexity with Chain-of-Thought Supervision
Core Theorem: For any base class Σ3 of finite VC dimension, the sample complexity for achieving accuracy Σ4 and confidence Σ5 in the Chain-of-Thought regime is independent of the chain length Σ6.
Σ7
where Σ8 (with Σ9 the dual VC dimension).
This result refines the previous state-of-the-art, which established only logarithmic dependence on f:Σ⋆→Σ0 [Joshi et al.], demonstrating that full CoT supervision removes all dependence on reasoning chain length when the base class has finite VC dimension.
A compression-based multiclass reduction is leveraged rather than a naive consistent learner, yielding the improvement; thus, for broad classes of practical interest (e.g., linear predictors), optimal sample complexity scaling in CoT can be asserted.
2. Sample Complexity for End-to-End Supervision: Taxonomy of Possible Growth Rates
In the E2E regime, the sample complexity is characterized via the VC-dimension of f:Σ⋆→Σ1. The main contribution here is a fine-grained taxonomy:
- For every monotone subadditive function f:Σ⋆→Σ2 (e.g., polylogarithmic, root, polynomial, or linear), there exists a class with E2E sample complexity that scales as f:Σ⋆→Σ3.
This means that, within the domain of PAC-learnable base classes, all intermediate scaling behaviors between constant and linear in f:Σ⋆→Σ4 are attainable. No dichotomy (e.g., logarithmic vs. linear) exists; the sample complexity landscape is provably rich.
Implication
The empirical and theoretical literature has often highlighted linear dependence as canonical; this work demonstrates that in fact, a continuum of rates is possible for E2E learning, given appropriate construction of base classes.
3. Impossibility of a Complete Dimension Theoretic Characterization for Sublinear Rates
The above richness immediately motivates the following: Can there exist a combinatorial dimension that precisely separates sublinear vs. linear (or superlinear) sample complexity growth in E2E learning? The paper shows, via a diagonalization argument, that the answer is negative.
- No combinatorial dimension (subject to mild criteria) can characterize sublinear E2E sample complexity behavior, nor provide tight quantitative (rate-based) upper bounds within the sublinear regime.
This resolves a principal open question and formalizes the limits of dimension-based sample complexity theory in the autoregressive E2E setting.
4. Sufficient Condition for Logarithmic E2E Sample Complexity
Nonetheless, the authors provide a sufficient, strictly broader condition than finite Littlestone dimension for logarithmic f:Σ⋆→Σ5-dependence, based on a newly introduced "autoregressive tree dimension" (ATdim). If the induced prefix tree over generations admits no large leveled perfect binary subtree, E2E sample complexity grows only logarithmically.
This condition can hold in cases where Littlestone dimension diverges, thus expanding the known class of "efficiently learnable" autoregressive generator families with sublinear scaling.
5. Pathologies and Limiting Results for Infinite VC Base Classes
For base classes with infinite VC dimension, the work constructs pathological examples. In particular, for some classes, learnability and sample complexity can alternate (e.g., be zero for all even f:Σ⋆→Σ6, infinite for all odd f:Σ⋆→Σ7), ruling out any regular characterization without base class complexity constraints.
Methodological Insights
The analysis makes key use of:
- Explicit combinatorial constructions to "engineer" desired VC-growth behavior
- Reductions to multiclass learning and application of sample compression bounds
- Tree-based combinatorial arguments for capturing scaling in the realized sequence space, allowing improved bounds in favorable regimes
- Tight lower bounds leveraging shattering arguments and reductions to classical learning hardness
Implications and Directions for Future Work
Practical Implications
- Full Chain-of-Thought supervision is maximally statistically efficient for finite-VC generator classes; increasing the chain length has no deleterious effect on data efficiency.
- In contrast, End-to-End supervision's information bottleneck can drive substantial sample complexity penalties, which may be arbitrarily large depending on the structure of the underlying generator class.
- For practical model design, the result underlines the critical benefit of fine-grained intermediate supervision and motivates dataset annotation protocols that permit this.
Theoretical Implications
- The fine taxonomy of E2E complexity rates and non-existence of a universal dimension-theoretic characterization have consequences for both structural learning theory and the development of complexity measures in sequential/recursive contexts.
- Tree-based combinatorial conditions (such as ATdim) can be used to classify new, practically relevant model classes that could admit efficient E2E learning despite lack of finite Littlestone dimension.
Open Questions
- Extension to infinite alphabets and multilabel output spaces (e.g., in proof synthesis or program induction contexts) is non-trivial, and classical symmetries of PAC learning fail.
- Relaxed base class assumptions (notably, dropping finite VC) are shown to admit pathological behaviors; thus, understanding necessary conditions for non-pathological sample complexity remains open.
- Improved explicit constants and sharp dependence on f:Σ⋆→Σ8 and class parameters in CoT regimes for wide classes of concept classes (beyond stable compression) are of practical interest.
Conclusion
This work provides a comprehensive theoretical classification of the sample complexity landscape for learning autoregressive next-token generators under different supervision paradigms. For finite-VC classes, Chain-of-Thought supervision completely eliminates chain-length dependence, while End-to-End supervision admits a spectrum of possible sample complexity growth behaviors, with no universal dimension-theoretic governing principle. Furthermore, the introduction of autoregressive tree dimension expands understanding of efficient E2E learning conditions. The analysis sets a new bar for theoretical work on sequence model learnability and informs both practical algorithm design and annotation strategies for strong statistical efficiency in autoregressive models.