Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Agnostic Universal Rates of ERM

Updated 30 June 2025

The paper establishes a trichotomy: exponential decay for finite H, super-root for infinite H with finite VC-dimension, and arbitrarily slow rates for infinite VC-dimension.
It quantifies error decay purely based on hypothesis space size and VC-dimension, offering practical insights for optimizing empirical risk minimization.
These universal rates refine classical PAC analysis by demonstrating how ERM learning curves adapt to inherent agnostic noise and model misspecification.

Agnostic universal rates of Empirical Risk Minimization (ERM) refer to the fundamental speed at which the expected excess classification risk of ERM decays with sample size for all possible data distributions, without assuming that the hypothesis class contains the true labeling function. Recent work, notably "Universal Rates of ERM for Agnostic Learning" (Hanneke & Xu, 2025), delivers a precise trichotomy of possible universal rates for ERM in the agnostic setting. These results clarify the behavior and limitations of ERM-based learning curves for broad concept classes and fixed target distributions, providing a complete combinatorial characterization and several structural refinements for finer, target-dependent rates.

1. Agnostic Universal Rate Trichotomy

ERM in the agnostic binary classification setting outputs a classifier $h$ by minimizing the sample mean of the $0$-$1$ error with respect to a class $H$ . The universal agnostic rate is defined via the decay rate of the expected excess risk:

$\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]$

as a function of the sample size $n$ , for every data distribution $P$ . The main result establishes a classification into exactly three rate regimes:

Universal Rate	Characterization
Exponential, $e^{-n}$	$\|H\|<\infty$
Super-root, $o(n^{-1/2})$	$\|H\| = \infty$ and $\mathrm{VC}(H) < \infty$
Arbitrarily slow	$\mathrm{VC}(H) = \infty$

If $H$ is finite, ERM achieves an exponentially fast rate in $n$ for every distribution.
If $H$ is infinite but has finite VC-dimension, ERM achieves a super-root rate: for every $P$ , the expected excess risk converges faster than $1/\sqrt{n}$ (and strictly slower than any exponential).
If $H$ has infinite VC-dimension, there exist distributions for which, under ERM, the expected excess risk decays slower than any prescribed positive function tending to zero.

2. Formal Definitions and Mathematical Formulations

Let $S_n = \{(x_i, y_i)\}_{i=1}^n$ be i.i.d. samples from $P$ . ERM selects

$h_{\mathrm{ERM}} \in \arg\min_{h \in H} \frac{1}{n} \sum_{i=1}^n \mathbb{I}\left[h(x_i)\ne y_i\right].$

The universal learning curve is the function $n \mapsto \mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)]$ , where the expectation is with respect to the sample, and "universal" means this bound holds for every fixed (but arbitrary) distribution $P$ .

The trichotomy states:

If $|H|<\infty$ , for any $P$ :

$\mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)] \leq Ce^{-cn}$

for constants $C, c>0$ depending on $H$ and $P$ .

If $|H|=\infty$ and $\mathrm{VC}(H)<\infty$ :

$\mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)] = o(n^{-1/2})$

(i.e., for every $\varepsilon>0$ , $n^{1/2-\varepsilon} \cdot [\text{excess risk}] \to 0$ as $n\to\infty$ ).

If $\mathrm{VC}(H)=\infty$ , then for any pre-specified rate $R(n)\to0$ , there is a distribution $P$ such that

$\mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)] \geq R(n)$

for infinitely many $n$ .

3. Combinatorial Characterization

The rate regime is dictated entirely by two class properties:

Finite hypothesis space: Implies exponential rate.
Infinite but VC-finite: Implies super-root rate, with the exact speed always strictly better than $n^{-1/2}$ (as in classical uniform PAC rates), but never exponential.
Infinite VC-dimension: Allows for distributions where the rate cannot be bounded by any prescribed function—there is no universal guarantee on the speed of error decay.

This classification is sharp and exhaustive for all $|H|\geq 3$ .

4. Target- and Bayes-Dependent Universal Rates

The paper supplies further subdivisions into target-dependent and Bayes-dependent universal rates, refining the analysis to the structure around the best-in-class or Bayes-optimal function.

Target-dependent trichotomy: For a given "target" $h^*\in H$ , the achievable rate for distributions where $h^*$ is the best can be classified by properties of how the "near-best" competitors of $h^*$ are organized (specifically, whether their set has finite VC dimension or not).
Bayes-dependent trichotomy: Considering the true Bayes classifier $\bar{h}$ , the rates depend on whether there exists an infinite (VC-)eluder sequence centered at $\bar{h}$ —a set witnessing sustainable disagreement among functions in $H$ .

These refinements provide a more finely grained view of distribution-dependent universal rates and clarify when ERM may adapt to "easy" or "hard" distributions for a fixed $H$ .

5. Contrast to the Realizable Universal Rates

In the realizable case (where $P$ is supported on some $h^*_0\in H$ ), "Universal Rates of Empirical Risk Minimization" (2412.02810) establishes a tetrachotomy (exponential, $1/n$, $(\log n)/n$ , arbitrarily slow), with a richer spectrum of rates possible due to the absence of agnostic noise. The agnostic trichotomy is more compact: in the non-realizable case, the presence of unavoidable noise or model misspecification rules out the intermediate $(\log n)/n$ rate for ERM; unless $H$ is finite, no faster than $o(n^{-1/2})$ rate can be guaranteed. This confirms that agnostic learning is strictly harder—fast rates require both limited class complexity and zero noise.

6. Implications and Applications

Practical interpretation: For any $H$ $H$ in applied classification, the agnostic universal learning rate achieved by ERM is dictated purely by the VC dimension and finiteness. Thus, practitioners using ERM should expect, for general distributions:
- Exponential error decay if the class is finite (rare in practice).
- Strictly super-root convergence (but not exponential and not $1/n$) if $H$ is infinite but has finite VC.
- No universal guarantees if $H$ has infinite VC-dimension.
Model selection: These trichotomy results justify the central importance of controlling class complexity—not just for worst-case PAC theory, but for universal, every-distribution (distribution-wise) learning curve guarantees.
Theory: These results provide definitive negative evidence against the possibility of $1/n$ universal excess risk rates for ERM in agnostic learning—regardless of the particular infinite VC class chosen.
Algorithm design: For improper learners or data-dependent methods, faster rates may sometimes be achievable in special noise/margin or target-adaptive regimes, but ERM is fundamentally limited according to this trichotomy.

Summary Table: Agnostic Universal Rates for ERM

Rate Regime	Class Characterization	Example
$e^{-n}$	$\|H\|<\infty$	Finite threshold/interval classes
$o(n^{-1/2})$	$\|H\| = \infty$ , $\mathrm{VC}(H)<\infty$	Halfspaces, finite-degree polynomials
Arbitrarily slow	$\mathrm{VC}(H)=\infty$	Unrestricted indicator classes

References to Key Results

The main trichotomy theorem: Section 3, "Agnostic universal rates for ERM, target-independent case."
Target-dependent and Bayes-dependent trichotomies: Section 4, Theorem 2 (target-dependent), Theorem 3 (Bayes-dependent).
Proofs, combinatorial definitions, and concrete class examples: throughout sections 3–5; for formal statements and detailed definitions of (VC-)eluder sequences centered at $h^*$ or the Bayes classifier, see the supplemental definitions.

Conclusion:

The work establishes a precise trichotomy for agnostic universal rates of ERM: exactly exponential, strictly super-root (but not $1/n$), or arbitrarily slow—fully characterized by standard class complexity (finite $H$ , VC-dimension) and new structural properties related to the neighborhood of optimal classifiers. This definitively resolves the structure of agnostic universal learning in ERM and provides clear compositional guidelines for statistical learning, model complexity control, and the design of robust empirical risk minimization systems.

PDF Markdown Chat (Upgrade)

References (1)

Universal Rates of Empirical Risk Minimization (2024)