Papers
Topics
Authors
Recent
Search
2000 character limit reached

PAC Learning for Next-Token Generators

Updated 4 July 2026
  • The paper introduces a PAC-learning model that treats next-token generation as a multi-step rollout problem, enhancing theoretical understanding.
  • It compares end-to-end and chain-of-thought supervision, demonstrating how intermediate computations can reduce sample complexity.
  • The work provides a comprehensive sample complexity taxonomy and novel combinatorial dimensions to classify autoregressive behavior.

Searching arXiv for the specified paper and closely related work on PAC-style/autoregressive next-token learning. to=arxiv_search.query 北京赛车开奖ված 下载彩神争霸 {"query":"all:\"Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End\" OR id:(Hanneke et al., 13 Apr 2026)","max_results":5,"sort_by":"relevance"} The PAC-learning framework for next-token generators studies autoregressive text generation as a statistical learning problem in which an unknown next-token rule is iteratively applied for TT steps, and the learning target is the input-output map induced by that rollout rather than merely the one-step predictor. In the formulation introduced by Joshi et al. and sharply extended in "Sample Complexity of Autoregressive Reasoning: Chain-of-Thought vs. End-to-End," a next-token generator maps a sequence to its next token, rollout length TT becomes an explicit complexity parameter, and the supervision model—observing only the final token or the entire intermediate trajectory—determines whether sample complexity can scale linearly in TT, logarithmically in TT, or not at all (Hanneke et al., 13 Apr 2026).

1. Formal autoregressive PAC model

The basic object is a next-token generator

f:ΣΣ,f:\Sigma^\star \to \Sigma,

where Σ\Sigma is a finite alphabet and Σ\Sigma^\star is the set of finite strings over Σ\Sigma. One autoregressive step is represented by the apply-and-append map

fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),

and iterating this map for TT steps produces a generated chain whose length-TT0 suffix is denoted TT1, while its last token is denoted TT2 (Hanneke et al., 13 Apr 2026).

This formulation makes the PAC target explicitly rollout-based. For a base class TT3, the induced classes are

TT4

Accordingly, the learning task is not necessarily to identify the one-step mechanism TT5, but to learn the map from prompt TT6 to the final token obtained after TT7 rounds of autoregressive generation.

The paper studies primarily the binary case TT8, while stating that the ideas extend to any finite alphabet. The learning setup is realizable: an unknown distribution TT9 over TT0 supplies i.i.d. prompts TT1, and labels are generated by some unknown TT2. Test-time error is always the end-to-end error

TT3

For finite-VC classes, the usual PAC sample complexity notation TT4 is used, with

TT5

Thus, in the end-to-end regime, the problem is exactly PAC learning of TT6, governed by TT7.

A common misconception is that next-token PAC theory concerns only the one-step predictor. In this framework, the relevant hypothesis class is the class induced by TT8 rounds of rollout, so the learnability question is intrinsically sequential rather than pointwise.

2. Supervision models: end-to-end and Chain-of-Thought

The framework formalizes two supervision models. Under End-to-End supervision, a training example reveals only the final token,

TT9

Under Chain-of-Thought supervision, a training example reveals the full generated trajectory,

TT0

where TT1 (Hanneke et al., 13 Apr 2026).

The distinction is asymmetric. Test-time evaluation remains end-to-end even in the CoT regime: the learner is still judged only on the final token. The paper is explicit that CoT supervision is a training-time side-information model, not an ordinary symmetric PAC classification task. This point matters because observing the entire intermediate chain supplies TT2 local constraints on the latent one-step generator, whereas ordinary end-to-end supervision exposes only the terminal outcome.

The central positive theorem for CoT supervision states that for every class TT3 of binary next-token generators with TT4, there exists a constant TT5 such that for every TT6,

TT7

More precisely, the proof yields TT8, where TT9 and f:ΣΣ,f:\Sigma^\star \to \Sigma,0 is the dual VC dimension. The stronger multiclass statement is that if f:ΣΣ,f:\Sigma^\star \to \Sigma,1, then f:ΣΣ,f:\Sigma^\star \to \Sigma,2 is learnable for all f:ΣΣ,f:\Sigma^\star \to \Sigma,3 with

f:ΣΣ,f:\Sigma^\star \to \Sigma,4

while if f:ΣΣ,f:\Sigma^\star \to \Sigma,5, then f:ΣΣ,f:\Sigma^\star \to \Sigma,6 is not PAC learnable for any f:ΣΣ,f:\Sigma^\star \to \Sigma,7 (Hanneke et al., 13 Apr 2026).

The lower bound is already present at f:ΣΣ,f:\Sigma^\star \to \Sigma,8: f:ΣΣ,f:\Sigma^\star \to \Sigma,9 For classes with stable compression, the paper gives the sharper bound

Σ\Sigma0

whenever Σ\Sigma1 admits a stable sample compression scheme of size Σ\Sigma2. This applies in particular to the linear autoregressor class

Σ\Sigma3

yielding

Σ\Sigma4

The paper’s conceptual interpretation is direct: CoT supervision turns one difficult Σ\Sigma5-step autoregressive problem into Σ\Sigma6 ordinary one-step supervision constraints on the base class. This suggests that observing intermediate computation can remove the statistical burden created by hidden latent rollouts.

3. Sample-complexity taxonomy as a function of rollout length

The most distinctive result of the framework is a taxonomy of how sample complexity scales with generation length Σ\Sigma7. In the CoT regime, the dependence on Σ\Sigma8 disappears entirely. In the end-to-end regime, by contrast, the space of possibilities is “remarkably rich”: subject to mild conditions, essentially any growth rate between constant and linear can occur (Hanneke et al., 13 Apr 2026).

Joshi et al. had established the general upper bound

Σ\Sigma9

and examples with linear growth, while also showing that finite Littlestone dimension implies logarithmic growth. The later theory proves that intermediate behaviors are not artifacts but genuinely realizable. A function Σ\Sigma^\star0 is called a monotone-subadditive rate if it is monotone non-decreasing and satisfies

Σ\Sigma^\star1

Then for every such rate Σ\Sigma^\star2, there exists a class Σ\Sigma^\star3 of binary next-token generators such that

Σ\Sigma^\star4

In the normalized case Σ\Sigma^\star5, the exact identity

Σ\Sigma^\star6

is obtained.

The realizable rates include constant growth, Σ\Sigma^\star7, polylogarithmic growth, power laws Σ\Sigma^\star8 for Σ\Sigma^\star9 after integer rounding, and rates arbitrarily close to linear. Since

Σ\Sigma0

the same taxonomy transfers immediately to end-to-end sample complexity.

The constructive part of the theory begins with the class

Σ\Sigma1

for infinite binary sequences Σ\Sigma2, and

Σ\Sigma3

This class has

Σ\Sigma4

Thus even a VC-dimension-1 base class can exhibit maximal linear growth in Σ\Sigma5. To realize arbitrary sublinear rates, the theory restricts the informative bit positions through subsets Σ\Sigma6, defining

Σ\Sigma7

Then there exists a class Σ\Sigma8 such that

Σ\Sigma9

and moreover

fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),0

A plausible implication is that end-to-end autoregressive learnability has no small list of canonical regimes. The framework instead identifies a broad continuum of possible statistical behaviors.

4. Combinatorial dimensions, impossibility results, and proof mechanisms

The same theory shows that no single combinatorial dimension can characterize all sublinear end-to-end rates. It defines a putative characterization as a map fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),1 together with an upper bound fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),2 such that fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),3 iff fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),4, and such that for each fixed fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),5, fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),6. The resulting theorem is negative: there is no dimension that characterizes sublinear rates in this sense (Hanneke et al., 13 Apr 2026).

Although no full dimension theory exists for all sublinear behavior, the paper introduces a new sufficient parameter, the autoregressive tree dimension fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),7. For a prompt fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),8 and rollout depth fˉ:ΣΣ,fˉ(x)=xf(x),\bar f:\Sigma^\star\to\Sigma^\star,\qquad \bar f(x)=x\circ f(x),9, the set of realized trajectories TT0 induces a prefix tree TT1. The quantity TT2 is the largest depth of a perfect leveled binary subtree realized somewhere in these generation trees. The main quantitative upper bound is

TT3

for sufficiently large TT4, and in asymptotic form

TT5

This strictly strengthens the earlier sufficient condition based on finite Littlestone dimension because

TT6

yet the paper also constructs classes with

TT7

So TT8 can be finite even when Littlestone dimension is infinite.

The proof mechanisms are equally characteristic. For CoT upper bounds, the central idea is to inflate each chain-of-thought sample into TT9 binary examples for the base class. If the sample contains TT00 with TT01, then

TT02

yields an ordinary labeled sample for TT03. The technical difficulty is then to compress the inflated sample without paying a factor of TT04. The paper adapts the majority-vote compression scheme of Moran and Yehudayoff to obtain compression size TT05, independent of TT06.

For logarithmic end-to-end upper bounds under finite TT07, the analysis relies on a new Sauer-type lemma for leveled binary trees: if a depth-TT08 tree contains no perfect leveled binary subtree of depth TT09, then the number of leaves is at most

TT10

Applied to realized generation trees, this gives

TT11

which replaces the trivial TT12 branch count.

The framework also emphasizes that structural assumptions are essential. Outside the finite-VC setting, pathological behavior can occur: there exists a base class TT13 such that for every even TT14,

TT15

while for every odd TT16, the problem is not learnable even under CoT supervision. This is a direct warning against extrapolating the positive theory beyond its stated hypotheses.

The broader literature develops several adjacent, but non-identical, learning-theoretic frameworks for autoregressive next-token prediction.

"Auto-Regressive Next-Token Predictors are Universal Learners" defines AR Learnable as a PAC analogue for sequence prediction under realizability, proves a transfer theorem stating that if TT17 are PAC learnable with sample complexity TT18, then the product class is AR learnable with sample complexity TT19, and introduces length complexity as the number of intermediate CoT tokens needed to compute or approximate a target function (Malach, 2023). That theory is explicit that it is not a full PAC theory in the classical sense, but it gives a formal bridge between per-step PAC learnability and autoregressive sequence learning.

"Towards Auto-Regressive Next-Token Prediction: In-Context Learning Emerges from Generalization" develops a PAC-Bayesian framework for pre-trained next-token generators under a hierarchical topic-conditioned sequence model. Its population loss is expressed as an expected KL divergence between the true and model next-token conditionals, and the main result is a two-level decomposition of population loss into empirical loss, sequence-generalization gaps, and topic-generalization gaps, together with data-dependent, topic-dependent, and optimization-dependent PAC-Bayesian bounds (Gong et al., 24 Feb 2025).

"Hardness of Learning Regular Languages in the Next Symbol Prediction Setting" formalizes a PAC-style model for learning the truncated support behavior of a next-token generator from positive examples annotated with per-prefix continuation sets and acceptance bits. Its main negative result is that, despite this richer supervision, efficient PAC learning of polynomial-size acyclic DFAs in the NSP model would imply efficient PAC learning in the conventional classification model; under Kearns–Valiant cryptographic assumptions, weak learning remains hard (Bhattamishra et al., 21 Oct 2025). This corrects the possible misconception that richer token-level supervision necessarily removes computational hardness.

"Provable Long-Range Benefits of Next-Token Prediction" studies a different guarantee: minimizing next-token log loss over an appropriate class of RNN LLMs yields generators that are TT20-indistinguishable from the training distribution for all bounded-size next-TT21-token RNN distinguishers, with model-size bounds polynomial in TT22 and independent of document length TT23 (Cao et al., 8 Dec 2025). The guarantee is complexity-theoretic rather than PAC in the classical sample-complexity sense.

"The Role of Generator Access in Autoregressive Post-Training" shifts attention from hypothesis complexity to oracle structure. In the no-reset or root-start regime, output sampling, generated-token log probabilities, top-TT24 reports, and full next-token distributions along sampled trajectories collapse to one canonical experiment, and transcript distinguishability is bounded by TT25, where TT26 is the on-policy probability of reaching informative prefixes. Weak prefix control breaks this barrier, and chosen-prefix sampling or logits can outperform top-1 access; for KL-regularized outcome-reward post-training, the paper proves an exponential gap between no-reset access and chosen-prefix access (Rege, 6 Apr 2026).

"Markov Chain Estimation with In-Context Learning" contributes an experimental benchmark rather than a PAC theorem, but it is directly relevant to the framework’s intuitions. Transformers trained only with next-token cross-entropy on sequences from families of Markov chains display a threshold phenomenon in model size and number of training chains: below the threshold they underfit or memorize, and above it they generalize to unseen transition matrices by estimating transition probabilities from context (Lepage et al., 5 Aug 2025).

6. Assumptions, scope, and enduring significance

The PAC-learning framework for next-token generators is deliberately narrow in several respects. The main positive structural results assume a finite alphabet, realizability, deterministic next-token generators, finite TT27, bounded rollout length TT28 as an explicit parameter, and an information-theoretic notion of learnability rather than a general computational-efficiency guarantee (Hanneke et al., 13 Apr 2026). The theory does not provide agnostic or noisy-label guarantees, does not analyze probabilistic next-token generators TT29 directly, and does not supply a general optimization theory for practical language-model training.

Within that scope, however, it provides a sharp answer to two foundational questions. First, end-to-end autoregressive learning does not have a single canonical dependence on reasoning length: essentially every monotone-subadditive rate between constant and linear can occur. Second, CoT supervision can eliminate dependence on reasoning length altogether. This suggests a precise statistical interpretation of reasoning traces: they expose latent intermediate states of autoregressive computation and thereby convert hidden multi-step inference into directly supervised local transitions.

The framework also reframes several debates. It shows that finite Littlestone dimension is not the right universal boundary for sublinear end-to-end growth, that no single dimension can characterize all sublinear regimes, and that richer supervision or richer generator outputs do not automatically remove computational or statistical barriers. In particular, the NSP hardness results and the generator-access separations show that one must distinguish between observing more labels, reaching the right prefixes, and solving the induced inference problem.

In that sense, the PAC-learning framework for next-token generators is best understood as a theory of rollout-induced hypothesis classes. Its core insight is that the learnability of autoregressive systems depends jointly on three axes: the base class TT30, the rollout length TT31, and the supervision or access model under which intermediate computation is hidden, revealed, or directly queryable.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PAC-Learning Framework for Next-Token Generators.