Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Agnostic Universal Rates of ERM

Updated 30 June 2025
  • The paper establishes a trichotomy: exponential decay for finite H, super-root for infinite H with finite VC-dimension, and arbitrarily slow rates for infinite VC-dimension.
  • It quantifies error decay purely based on hypothesis space size and VC-dimension, offering practical insights for optimizing empirical risk minimization.
  • These universal rates refine classical PAC analysis by demonstrating how ERM learning curves adapt to inherent agnostic noise and model misspecification.

Agnostic universal rates of Empirical Risk Minimization (ERM) refer to the fundamental speed at which the expected excess classification risk of ERM decays with sample size for all possible data distributions, without assuming that the hypothesis class contains the true labeling function. Recent work, notably "Universal Rates of ERM for Agnostic Learning" (Hanneke & Xu, 2025), delivers a precise trichotomy of possible universal rates for ERM in the agnostic setting. These results clarify the behavior and limitations of ERM-based learning curves for broad concept classes and fixed target distributions, providing a complete combinatorial characterization and several structural refinements for finer, target-dependent rates.

1. Agnostic Universal Rate Trichotomy

ERM in the agnostic binary classification setting outputs a classifier hh by minimizing the sample mean of the $0$-$1$ error with respect to a class HH. The universal agnostic rate is defined via the decay rate of the expected excess risk:

E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]

as a function of the sample size nn, for every data distribution PP. The main result establishes a classification into exactly three rate regimes:

Universal Rate Characterization
Exponential, ene^{-n} H<|H|<\infty
Super-root, o(n1/2)o(n^{-1/2}) H=|H| = \infty and VC(H)<\mathrm{VC}(H) < \infty
Arbitrarily slow VC(H)=\mathrm{VC}(H) = \infty
  • If HH is finite, ERM achieves an exponentially fast rate in nn for every distribution.
  • If HH is infinite but has finite VC-dimension, ERM achieves a super-root rate: for every PP, the expected excess risk converges faster than 1/n1/\sqrt{n} (and strictly slower than any exponential).
  • If HH has infinite VC-dimension, there exist distributions for which, under ERM, the expected excess risk decays slower than any prescribed positive function tending to zero.

2. Formal Definitions and Mathematical Formulations

Let Sn={(xi,yi)}i=1nS_n = \{(x_i, y_i)\}_{i=1}^n be i.i.d. samples from PP. ERM selects

hERMargminhH1ni=1nI[h(xi)yi].h_{\mathrm{ERM}} \in \arg\min_{h \in H} \frac{1}{n} \sum_{i=1}^n \mathbb{I}\left[h(x_i)\ne y_i\right].

The universal learning curve is the function nE[er(hERM)infhHer(h)]n \mapsto \mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)], where the expectation is with respect to the sample, and "universal" means this bound holds for every fixed (but arbitrary) distribution PP.

The trichotomy states:

  • If H<|H|<\infty, for any PP:

E[er(hERM)infhHer(h)]Cecn\mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)] \leq Ce^{-cn}

for constants C,c>0C, c>0 depending on HH and PP.

  • If H=|H|=\infty and VC(H)<\mathrm{VC}(H)<\infty:

E[er(hERM)infhHer(h)]=o(n1/2)\mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)] = o(n^{-1/2})

(i.e., for every ε>0\varepsilon>0, n1/2ε[excess risk]0n^{1/2-\varepsilon} \cdot [\text{excess risk}] \to 0 as nn\to\infty).

  • If VC(H)=\mathrm{VC}(H)=\infty, then for any pre-specified rate R(n)0R(n)\to0, there is a distribution PP such that

E[er(hERM)infhHer(h)]R(n)\mathbb{E}[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h\in H}\mathrm{er}(h)] \geq R(n)

for infinitely many nn.

3. Combinatorial Characterization

The rate regime is dictated entirely by two class properties:

  • Finite hypothesis space: Implies exponential rate.
  • Infinite but VC-finite: Implies super-root rate, with the exact speed always strictly better than n1/2n^{-1/2} (as in classical uniform PAC rates), but never exponential.
  • Infinite VC-dimension: Allows for distributions where the rate cannot be bounded by any prescribed function—there is no universal guarantee on the speed of error decay.

This classification is sharp and exhaustive for all H3|H|\geq 3.

4. Target- and Bayes-Dependent Universal Rates

The paper supplies further subdivisions into target-dependent and Bayes-dependent universal rates, refining the analysis to the structure around the best-in-class or Bayes-optimal function.

  • Target-dependent trichotomy: For a given "target" hHh^*\in H, the achievable rate for distributions where hh^* is the best can be classified by properties of how the "near-best" competitors of hh^* are organized (specifically, whether their set has finite VC dimension or not).
  • Bayes-dependent trichotomy: Considering the true Bayes classifier hˉ\bar{h}, the rates depend on whether there exists an infinite (VC-)eluder sequence centered at hˉ\bar{h}—a set witnessing sustainable disagreement among functions in HH.

These refinements provide a more finely grained view of distribution-dependent universal rates and clarify when ERM may adapt to "easy" or "hard" distributions for a fixed HH.

5. Contrast to the Realizable Universal Rates

In the realizable case (where PP is supported on some h0Hh^*_0\in H), "Universal Rates of Empirical Risk Minimization" (2412.02810) establishes a tetrachotomy (exponential, $1/n$, (logn)/n(\log n)/n, arbitrarily slow), with a richer spectrum of rates possible due to the absence of agnostic noise. The agnostic trichotomy is more compact: in the non-realizable case, the presence of unavoidable noise or model misspecification rules out the intermediate (logn)/n(\log n)/n rate for ERM; unless HH is finite, no faster than o(n1/2)o(n^{-1/2}) rate can be guaranteed. This confirms that agnostic learning is strictly harder—fast rates require both limited class complexity and zero noise.

6. Implications and Applications

  • Practical interpretation: For any HH in applied classification, the agnostic universal learning rate achieved by ERM is dictated purely by the VC dimension and finiteness. Thus, practitioners using ERM should expect, for general distributions:
    • Exponential error decay if the class is finite (rare in practice).
    • Strictly super-root convergence (but not exponential and not $1/n$) if HH is infinite but has finite VC.
    • No universal guarantees if HH has infinite VC-dimension.
  • Model selection: These trichotomy results justify the central importance of controlling class complexity—not just for worst-case PAC theory, but for universal, every-distribution (distribution-wise) learning curve guarantees.
  • Theory: These results provide definitive negative evidence against the possibility of $1/n$ universal excess risk rates for ERM in agnostic learning—regardless of the particular infinite VC class chosen.
  • Algorithm design: For improper learners or data-dependent methods, faster rates may sometimes be achievable in special noise/margin or target-adaptive regimes, but ERM is fundamentally limited according to this trichotomy.

Summary Table: Agnostic Universal Rates for ERM

Rate Regime Class Characterization Example
ene^{-n} H<|H|<\infty Finite threshold/interval classes
o(n1/2)o(n^{-1/2}) H=|H| = \infty, VC(H)<\mathrm{VC}(H)<\infty Halfspaces, finite-degree polynomials
Arbitrarily slow VC(H)=\mathrm{VC}(H)=\infty Unrestricted indicator classes

References to Key Results

  • The main trichotomy theorem: Section 3, "Agnostic universal rates for ERM, target-independent case."
  • Target-dependent and Bayes-dependent trichotomies: Section 4, Theorem 2 (target-dependent), Theorem 3 (Bayes-dependent).
  • Proofs, combinatorial definitions, and concrete class examples: throughout sections 3–5; for formal statements and detailed definitions of (VC-)eluder sequences centered at hh^* or the Bayes classifier, see the supplemental definitions.

Conclusion:

The work establishes a precise trichotomy for agnostic universal rates of ERM: exactly exponential, strictly super-root (but not $1/n$), or arbitrarily slow—fully characterized by standard class complexity (finite HH, VC-dimension) and new structural properties related to the neighborhood of optimal classifiers. This definitively resolves the structure of agnostic universal learning in ERM and provides clear compositional guidelines for statistical learning, model complexity control, and the design of robust empirical risk minimization systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)