Agnostic Multiclass PAC Sample Complexity

Updated 23 November 2025

The paper establishes tight sample complexity bounds using a three-stage procedure that integrates improper covers, multiplicative weights, and sample-compression.
It demonstrates that agnostic multiclass learning hinges on two key dimensions—Natarajan and DS—with Natarajan dominating high-accuracy regimes and DS determining overall learnability.
The results resolve longstanding questions by showing that both dimensions are crucial: the Natarajan term drives the 1/ε² bound while the DS term governs the 1/ε regime, even under bandit feedback.

The agnostic multiclass PAC (Probably Approximately Correct) sample complexity problem seeks to characterize the number of labeled examples required to learn a multiclass hypothesis class, with no assumption that the target hypothesis is included in the class. Unlike the binary case, which is governed by a single combinatorial parameter—the VC dimension—multiclass learning in the agnostic setting is controlled by two distinct combinatorial dimensions. Recent breakthroughs have precisely delineated the roles of the Natarajan and the Daniely–Shalev-Shwartz (DS) dimensions in governing agnostic sample complexity, resolving foundational questions regarding which structural parameters dictate learnability and asymptotic rates.

1. Fundamental Combinatorial Dimensions

Let $\mathcal{X}$ denote the instance space and $\mathcal{Y}$ the (possibly infinite) label space. For any multiclass hypothesis class $\mathcal{H}\subseteq\mathcal{Y}^\mathcal{X}$ , three key dimensions are fundamental:

Natarajan Dimension ( $d_N$ ): The maximum $d$ such that there exist points $x_1,\ldots,x_d\in\mathcal{X}$ and two labelings $f,g:\{1,\ldots,d\}\to\mathcal{Y}$ with $f(i)\neq g(i)$ for all $i$ , where for every subset $S\subseteq\{1,\ldots,d\}$ a hypothesis $h\in\mathcal{H}$ exists predicting $h(x_i)=f(i)$ for $i\in S$ , $h(x_j)=g(j)$ for $j\notin S$ .
DS Dimension ( $d_{DS}$ ): The largest $d$ for which the class can realize all labelings corresponding to a $d$ -dimensional “pseudo-cube” in $\mathcal{Y}^d$ . For binary labels, this coincides with the VC dimension.
Realizable Dimension ( $d_{RE}$ ): The least sample size $n$ ensuring that, for every realizable distribution, a deterministic learner achieves error at most $r<1/2$ in expectation, taking $d_{RE}(\mathcal{H})=\inf_A d_{RE}(A,\mathcal{H},1/(9e))$ . It is established that $\Omega(d_{DS})\le d_{RE}(\mathcal{H})\le \tilde{O}(d_{DS}^{1.5})$ .

These dimensions capture different aspects of combinatorial complexity for multiclass classes and are provably not equivalent; $d_{DS}$ can be arbitrarily larger than $d_N$ (Cohen et al., 16 Nov 2025).

2. Tight Bounds on Agnostic Sample Complexity

The agnostic sample complexity $m_{AG}(\epsilon,\delta)$ is the minimum sample size for which a learner output $\hat{h}$ , with probability at least $1-\delta$ , achieves risk at most $\inf_{h\in\mathcal{H}}er_P(h)+\epsilon$ under all $P$ over $\mathcal{X}\times\mathcal{Y}$ . The main result is:

$m_{AG}(\epsilon,\delta) \asymp \frac{d_N}{\epsilon^2} + \frac{d_{RE}}{\epsilon} + \frac{\ln(1/\delta)}{\epsilon^2}$

Substituting $d_{RE}=\tilde{O}(d_{DS}^{1.5})$ gives (up to logarithmic factors):

$m_{AG}(\epsilon,\delta) = O\left(\frac{d_N}{\epsilon^2} + \frac{d_{DS}^{1.5}}{\epsilon} + \frac{\ln(1/\delta)}{\epsilon^2}\right)$

Both $d_N/\epsilon^2$ and $d_{RE}/\epsilon$ terms are necessary: the first dominates in high-accuracy (small- $\epsilon$ ) regimes, while the second dominates in high-noise scenarios (Cohen et al., 16 Nov 2025). This dual dependence is unique to the agnostic multiclass setting, contrasting with the binary case where the VC dimension suffices.

3. Algorithmic Methodology and Proof Outline

The proof establishes the upper bound via a three-stage procedure integrating improper learning, online multiplicative weights, and sample-compression:

Stage 1: Construct a finite improper cover $F$ of the hypothesis class using $m_1 = \tilde{O}(d_{RE}/\epsilon)$ samples and a realizable learner. Every $h\in\mathcal{H}$ is closely approximated by some $f\in F$ —guaranteeing agreement except on an $O(\epsilon)$ fraction of points.
Stage 2: Reduce the effective label space by running an online, self-adaptive multiplicative-weights process over $F$ for $T=\Theta((\log|F|)/\epsilon)$ rounds. The MW algorithm constructs a “menu” $\mu(x)$ of candidate predictions which, with appropriate regret guarantees, covers the true label in nearly every instance for all $f\in F$ .
Stage 3: Restrict attention to prediction from $\mu(x)$ and apply sample-compression under the partial-concept loss, leveraging $m_3 = \tilde{O}(d_N \log T / \epsilon^2)$ fresh samples and an ERM or one-inclusion approach.

Notably, traditional uniform convergence techniques are insufficient in the presence of unbounded label sets, and reductions to the realizable case also break down for improper multiclass learners. The new MW-based reduction avoids these obstacles and is inherently improper and adaptive.

4. Prior Work and Resolution of the Natarajan Dimension Question

Historical approaches (e.g., Daniely–Shalev-Shwartz 2014; Brukhim et al. 2022) established that finite DS dimension characterizes multiclass learnability but left open whether the Natarajan dimension materially impacts agnostic rates. Earlier quantitative bounds were of the form $d_{DS}^{1.5}/\epsilon^2$ , implying no direct role for $d_N$ in algorithms' rates. The new results definitively show that, in the low-noise and high-accuracy regime ( $\epsilon\to 0$ ), the Natarajan term $d_N/\epsilon^2$ is not only present but leading. This recovers classical lower bounds and clarifies that agnostic multiclass PAC learning fundamentally depends on both the DS and Natarajan dimensions (Cohen et al., 16 Nov 2025).

5. Comparisons: Bandit Feedback and Full-Information Models

Extensions to settings with limited feedback provide further context. In the agnostic bandit feedback setting with a finite class $H$ and label set of size $K$ , the sample complexity achieves $m=O((\text{poly}(K) + 1/\epsilon^2)\, \ln(|H|/\delta))$ , a rate that up to logarithmic factors matches the optimal $O((1/\epsilon^2)\,\ln|H|)$ bound for full-information PAC learning. The bandit model introduces only an $O(1)$ multiplicative overhead as $\epsilon\to 0$ , a contrast with the realizable case where the gap is $\Theta(K)$ (Erez et al., 18 Jun 2024). When generalizing to infinite classes with finite Natarajan dimension, the sample complexity becomes $O((\text{poly}(K)+1/\epsilon^2)\,d_N\ln(1/\delta))$ , further cementing the primacy of $d_N/\epsilon^2$ in the agnostic small- $\epsilon$ regime.

6. Implications and Structural Consequences

Multiclass agnostic PAC learning uniquely involves two structural parameters—DS dimension and Natarajan dimension—which exert control over different accuracy regimes:

DS Dimension ( $d_{DS}$ ): Governs learnability (whether a class is agnostic-PAC-learnable at all) and the $1/\epsilon$ sample complexity regime.
Natarajan Dimension ( $d_N$ ): Governs the $1/\epsilon^2$ term, crucial in high-accuracy/low-noise regimes, and controls the uniform-convergence cost once the label space is effectively bounded.

As $d_{DS}$ can be much larger than $d_N$ , there exist hypothesis classes for which the $d_N/\epsilon^2$ term is eventually dominant. The key methodological innovation—online multipicative-weights label-space reduction plus sample-compression—circumvents obstacles that thwart uniform convergence and proper-reduction arguments typical in binary and online integrals.

A plausible implication is that related list-bounded or partial-concept loss approaches may apply to other non-ERM multiclass settings with similar combinatorial pathologies.

7. Summary Table: Sample Complexity Dependence

Regime	Leading Term	Dimension Involved
Low noise / high accuracy ( $\epsilon\to 0$ )	$d_N/\epsilon^2$	Natarajan ( $d_N$ )
High noise / moderate accuracy	$d_{RE}/\epsilon$ or $d_{DS}^{1.5}/\epsilon$	DS / Realizable ( $d_{DS}$ , $d_{RE}$ )
Bandit feedback (finite class)	$O((\mathrm{poly}(K)+1/\epsilon^2)\ln(\|H\|/\delta))$	Primarily $d_N$ , plus $K$ factors

This framework resolves the longstanding question of whether the Natarajan dimension matters for agnostic multiclass PAC learning: it does—dictating the dominant $1/\epsilon^2$ term—while the DS dimension dictates the $1/\epsilon$ regime and overall learnability (Cohen et al., 16 Nov 2025, Erez et al., 18 Jun 2024).

PDF Markdown Chat (Pro)

References (2)

Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back (2025)

Fast Rates for Bandit PAC Multiclass Classification (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Agnostic Multiclass PAC Sample Complexity.