PAC Learning for Next-Token Generators

Updated 4 July 2026

The paper introduces a PAC-learning model that treats next-token generation as a multi-step rollout problem, enhancing theoretical understanding.
It compares end-to-end and chain-of-thought supervision, demonstrating how intermediate computations can reduce sample complexity.
The work provides a comprehensive sample complexity taxonomy and novel combinatorial dimensions to classify autoregressive behavior.

Searching arXiv for the specified paper and closely related work on PAC-style/autoregressive next-token learning. to=arxiv_search.query 北京赛车开奖ված 下载彩神争霸 {"query":"all:\"Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End\" OR id:(Hanneke et al., 13 Apr 2026)","max_results":5,"sort_by":"relevance"} The PAC-learning framework for next-token generators studies autoregressive text generation as a statistical learning problem in which an unknown next-token rule is iteratively applied for $T$ steps, and the learning target is the input-output map induced by that rollout rather than merely the one-step predictor. In the formulation introduced by Joshi et al. and sharply extended in "Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End," a next-token generator maps a sequence to its next token, rollout length $T$ becomes an explicit complexity parameter, and the supervision model—observing only the final token or the entire intermediate trajectory—determines whether sample complexity can scale linearly in $T$ , logarithmically in $T$ , or not at all (Hanneke et al., 13 Apr 2026).

1. Formal autoregressive PAC model

The basic object is a next-token generator

$f:\Sigma^\star \to \Sigma,$

where $\Sigma$ is a finite alphabet and $\Sigma^\star$ is the set of finite strings over $\Sigma$ . One autoregressive step is represented by the apply-and-append map

$\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$

and iterating this map for $T$ steps produces a generated chain whose length- $T$ 0 suffix is denoted $T$ 1, while its last token is denoted $T$ 2 (Hanneke et al., 13 Apr 2026).

This formulation makes the PAC target explicitly rollout-based. For a base class $T$ 3, the induced classes are

$T$ 4

Accordingly, the learning task is not necessarily to identify the one-step mechanism $T$ 5, but to learn the map from prompt $T$ 6 to the final token obtained after $T$ 7 rounds of autoregressive generation.

The paper studies primarily the binary case $T$ 8, while stating that the ideas extend to any finite alphabet. The learning setup is realizable: an unknown distribution $T$ 9 over $T$ 0 supplies i.i.d. prompts $T$ 1, and labels are generated by some unknown $T$ 2. Test-time error is always the end-to-end error

$T$ 3

For finite-VC classes, the usual PAC sample complexity notation $T$ 4 is used, with

$T$ 5

Thus, in the end-to-end regime, the problem is exactly PAC learning of $T$ 6, governed by $T$ 7.

A common misconception is that next-token PAC theory concerns only the one-step predictor. In this framework, the relevant hypothesis class is the class induced by $T$ 8 rounds of rollout, so the learnability question is intrinsically sequential rather than pointwise.

2. Supervision models: end-to-end and Chain-of-Thought

The framework formalizes two supervision models. Under End-to-End supervision, a training example reveals only the final token,

$T$ 9

Under Chain-of-Thought supervision, a training example reveals the full generated trajectory,

$T$ 0

where $T$ 1 (Hanneke et al., 13 Apr 2026).

The distinction is asymmetric. Test-time evaluation remains end-to-end even in the CoT regime: the learner is still judged only on the final token. The paper is explicit that CoT supervision is a training-time side-information model, not an ordinary symmetric PAC classification task. This point matters because observing the entire intermediate chain supplies $T$ 2 local constraints on the latent one-step generator, whereas ordinary end-to-end supervision exposes only the terminal outcome.

The central positive theorem for CoT supervision states that for every class $T$ 3 of binary next-token generators with $T$ 4, there exists a constant $T$ 5 such that for every $T$ 6,

$T$ 7

More precisely, the proof yields $T$ 8, where $T$ 9 and $f:\Sigma^\star \to \Sigma,$ 0 is the dual VC dimension. The stronger multiclass statement is that if $f:\Sigma^\star \to \Sigma,$ 1, then $f:\Sigma^\star \to \Sigma,$ 2 is learnable for all $f:\Sigma^\star \to \Sigma,$ 3 with

$f:\Sigma^\star \to \Sigma,$ 4

while if $f:\Sigma^\star \to \Sigma,$ 5, then $f:\Sigma^\star \to \Sigma,$ 6 is not PAC learnable for any $f:\Sigma^\star \to \Sigma,$ 7 (Hanneke et al., 13 Apr 2026).

The lower bound is already present at $f:\Sigma^\star \to \Sigma,$ 8: $f:\Sigma^\star \to \Sigma,$ 9 For classes with stable compression, the paper gives the sharper bound

$\Sigma$ 0

whenever $\Sigma$ 1 admits a stable sample compression scheme of size $\Sigma$ 2. This applies in particular to the linear autoregressor class

$\Sigma$ 3

yielding

$\Sigma$ 4

The paper’s conceptual interpretation is direct: CoT supervision turns one difficult $\Sigma$ 5-step autoregressive problem into $\Sigma$ 6 ordinary one-step supervision constraints on the base class. This suggests that observing intermediate computation can remove the statistical burden created by hidden latent rollouts.

3. Sample-complexity taxonomy as a function of rollout length

The most distinctive result of the framework is a taxonomy of how sample complexity scales with generation length $\Sigma$ 7. In the CoT regime, the dependence on $\Sigma$ 8 disappears entirely. In the end-to-end regime, by contrast, the space of possibilities is “remarkably rich”: subject to mild conditions, essentially any growth rate between constant and linear can occur (Hanneke et al., 13 Apr 2026).

Joshi et al. had established the general upper bound

$\Sigma$ 9

and examples with linear growth, while also showing that finite Littlestone dimension implies logarithmic growth. The later theory proves that intermediate behaviors are not artifacts but genuinely realizable. A function $\Sigma^\star$ 0 is called a monotone-subadditive rate if it is monotone non-decreasing and satisfies

$\Sigma^\star$ 1

Then for every such rate $\Sigma^\star$ 2, there exists a class $\Sigma^\star$ 3 of binary next-token generators such that

$\Sigma^\star$ 4

In the normalized case $\Sigma^\star$ 5, the exact identity

$\Sigma^\star$ 6

is obtained.

The realizable rates include constant growth, $\Sigma^\star$ 7, polylogarithmic growth, power laws $\Sigma^\star$ 8 for $\Sigma^\star$ 9 after integer rounding, and rates arbitrarily close to linear. Since

$\Sigma$ 0

the same taxonomy transfers immediately to end-to-end sample complexity.

The constructive part of the theory begins with the class

$\Sigma$ 1

for infinite binary sequences $\Sigma$ 2, and

$\Sigma$ 3

This class has

$\Sigma$ 4

Thus even a VC-dimension-1 base class can exhibit maximal linear growth in $\Sigma$ 5. To realize arbitrary sublinear rates, the theory restricts the informative bit positions through subsets $\Sigma$ 6, defining

$\Sigma$ 7

Then there exists a class $\Sigma$ 8 such that

$\Sigma$ 9

and moreover

$\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 0

A plausible implication is that end-to-end autoregressive learnability has no small list of canonical regimes. The framework instead identifies a broad continuum of possible statistical behaviors.

4. Combinatorial dimensions, impossibility results, and proof mechanisms

The same theory shows that no single combinatorial dimension can characterize all sublinear end-to-end rates. It defines a putative characterization as a map $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 1 together with an upper bound $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 2 such that $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 3 iff $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 4, and such that for each fixed $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 5, $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 6. The resulting theorem is negative: there is no dimension that characterizes sublinear rates in this sense (Hanneke et al., 13 Apr 2026).

Although no full dimension theory exists for all sublinear behavior, the paper introduces a new sufficient parameter, the autoregressive tree dimension $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 7. For a prompt $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 8 and rollout depth $\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),$ 9, the set of realized trajectories $T$ 0 induces a prefix tree $T$ 1. The quantity $T$ 2 is the largest depth of a perfect leveled binary subtree realized somewhere in these generation trees. The main quantitative upper bound is

$T$ 3

for sufficiently large $T$ 4, and in asymptotic form

$T$ 5

This strictly strengthens the earlier sufficient condition based on finite Littlestone dimension because

$T$ 6

yet the paper also constructs classes with

$T$ 7

So $T$ 8 can be finite even when Littlestone dimension is infinite.

The proof mechanisms are equally characteristic. For CoT upper bounds, the central idea is to inflate each chain-of-thought sample into $T$ 9 binary examples for the base class. If the sample contains $T$ 00 with $T$ 01, then

$T$ 02

yields an ordinary labeled sample for $T$ 03. The technical difficulty is then to compress the inflated sample without paying a factor of $T$ 04. The paper adapts the majority-vote compression scheme of Moran and Yehudayoff to obtain compression size $T$ 05, independent of $T$ 06.

For logarithmic end-to-end upper bounds under finite $T$ 07, the analysis relies on a new Sauer-type lemma for leveled binary trees: if a depth- $T$ 08 tree contains no perfect leveled binary subtree of depth $T$ 09, then the number of leaves is at most

$T$ 10

Applied to realized generation trees, this gives

$T$ 11

which replaces the trivial $T$ 12 branch count.

The framework also emphasizes that structural assumptions are essential. Outside the finite-VC setting, pathological behavior can occur: there exists a base class $T$ 13 such that for every even $T$ 14,

$T$ 15

while for every odd $T$ 16, the problem is not learnable even under CoT supervision. This is a direct warning against extrapolating the positive theory beyond its stated hypotheses.

The broader literature develops several adjacent, but non-identical, learning-theoretic frameworks for autoregressive next-token prediction.

"Auto-Regressive Next-Token Predictors are Universal Learners" defines AR Learnable as a PAC analogue for sequence prediction under realizability, proves a transfer theorem stating that if $T$ 17 are PAC learnable with sample complexity $T$ 18, then the product class is AR learnable with sample complexity $T$ 19, and introduces length complexity as the number of intermediate CoT tokens needed to compute or approximate a target function (Malach, 2023). That theory is explicit that it is not a full PAC theory in the classical sense, but it gives a formal bridge between per-step PAC learnability and autoregressive sequence learning.

"Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization" develops a PAC-Bayesian framework for pre-trained next-token generators under a hierarchical topic-conditioned sequence model. Its population loss is expressed as an expected KL divergence between the true and model next-token conditionals, and the main result is a two-level decomposition of population loss into empirical loss, sequence-generalization gaps, and topic-generalization gaps, together with data-dependent, topic-dependent, and optimization-dependent PAC-Bayesian bounds (Gong et al., 24 Feb 2025).

"Hardness of Learning Regular Languages in the Next Symbol Prediction Setting" formalizes a PAC-style model for learning the truncated support behavior of a next-token generator from positive examples annotated with per-prefix continuation sets and acceptance bits. Its main negative result is that, despite this richer supervision, efficient PAC learning of polynomial-size acyclic DFAs in the NSP model would imply efficient PAC learning in the conventional classification model; under Kearns–Valiant cryptographic assumptions, weak learning remains hard (Bhattamishra et al., 21 Oct 2025). This corrects the possible misconception that richer token-level supervision necessarily removes computational hardness.

"Provable Long-Range Benefits of Next-Token Prediction" studies a different guarantee: minimizing next-token log loss over an appropriate class of RNN LLMs yields generators that are $T$ 20-indistinguishable from the training distribution for all bounded-size next- $T$ 21-token RNN distinguishers, with model-size bounds polynomial in $T$ 22 and independent of document length $T$ 23 (Cao et al., 8 Dec 2025). The guarantee is complexity-theoretic rather than PAC in the classical sample-complexity sense.

"The Role of Generator Access in Autoregressive Post-Training" shifts attention from hypothesis complexity to oracle structure. In the no-reset or root-start regime, output sampling, generated-token log probabilities, top- $T$ 24 reports, and full next-token distributions along sampled trajectories collapse to one canonical experiment, and transcript distinguishability is bounded by $T$ 25, where $T$ 26 is the on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, and chosen-prefix sampling or logits can outperform top-1 access; for KL-regularized outcome-reward post-training, the paper proves an exponential gap between no-reset access and chosen-prefix access (Rege, 6 Apr 2026).

"Markov Chain Estimation with In-Context Learning" contributes an experimental benchmark rather than a PAC theorem, but it is directly relevant to the framework’s intuitions. Transformers trained only with next-token cross-entropy on sequences from families of Markov chains display a threshold phenomenon in model size and number of training chains: below the threshold they underfit or memorize, and above it they generalize to unseen transition matrices by estimating transition probabilities from context (Lepage et al., 5 Aug 2025).

6. Assumptions, scope, and enduring significance

The PAC-learning framework for next-token generators is deliberately narrow in several respects. The main positive structural results assume a finite alphabet, realizability, deterministic next-token generators, finite $T$ 27, bounded rollout length $T$ 28 as an explicit parameter, and an information-theoretic notion of learnability rather than a general computational-efficiency guarantee (Hanneke et al., 13 Apr 2026). The theory does not provide agnostic or noisy-label guarantees, does not analyze probabilistic next-token generators $T$ 29 directly, and does not supply a general optimization theory for practical language-model training.

Within that scope, however, it provides a sharp answer to two foundational questions. First, end-to-end autoregressive learning does not have a single canonical dependence on reasoning length: essentially every monotone-subadditive rate between constant and linear can occur. Second, CoT supervision can eliminate dependence on reasoning length altogether. This suggests a precise statistical interpretation of reasoning traces: they expose latent intermediate states of autoregressive computation and thereby convert hidden multi-step inference into directly supervised local transitions.

The framework also reframes several debates. It shows that finite Littlestone dimension is not the right universal boundary for sublinear end-to-end growth, that no single dimension can characterize all sublinear regimes, and that richer supervision or richer generator outputs do not automatically remove computational or statistical barriers. In particular, the NSP hardness results and the generator-access separations show that one must distinguish between observing more labels, reaching the right prefixes, and solving the induced inference problem.

In that sense, the PAC-learning framework for next-token generators is best understood as a theory of rollout-induced hypothesis classes. Its core insight is that the learnability of autoregressive systems depends jointly on three axes: the base class $T$ 30, the rollout length $T$ 31, and the supervision or access model under which intermediate computation is hidden, revealed, or directly queryable.