Papers
Topics
Authors
Recent
Search
2000 character limit reached

Agnostic Universal Rates of ERM

Updated 30 June 2025
  • The paper establishes a trichotomy: exponential decay for finite H, super-root for infinite H with finite VC-dimension, and arbitrarily slow rates for infinite VC-dimension.
  • It quantifies error decay purely based on hypothesis space size and VC-dimension, offering practical insights for optimizing empirical risk minimization.
  • These universal rates refine classical PAC analysis by demonstrating how ERM learning curves adapt to inherent agnostic noise and model misspecification.

Agnostic universal rates of Empirical Risk Minimization (ERM) refer to the fundamental speed at which the expected excess classification risk of ERM decays with sample size for all possible data distributions, without assuming that the hypothesis class contains the true labeling function. Recent work, notably "Universal Rates of ERM for Agnostic Learning" (Hanneke & Xu, 2025), delivers a precise trichotomy of possible universal rates for ERM in the agnostic setting. These results clarify the behavior and limitations of ERM-based learning curves for broad concept classes and fixed target distributions, providing a complete combinatorial characterization and several structural refinements for finer, target-dependent rates.

1. Agnostic Universal Rate Trichotomy

ERM in the agnostic binary classification setting outputs a classifier hh by minimizing the sample mean of the $0$-$1$ error with respect to a class HH. The universal agnostic rate is defined via the decay rate of the expected excess risk:

E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]

as a function of the sample size nn, for every data distribution PP. The main result establishes a classification into exactly three rate regimes:

Universal Rate Characterization
Exponential, ene^{-n} H<|H|<\infty
Super-root, o(n1/2)o(n^{-1/2}) $0$0 and $0$1
Arbitrarily slow $0$2
  • If $0$3 is finite, ERM achieves an exponentially fast rate in $0$4 for every distribution.
  • If $0$5 is infinite but has finite VC-dimension, ERM achieves a super-root rate: for every $0$6, the expected excess risk converges faster than $0$7 (and strictly slower than any exponential).
  • If $0$8 has infinite VC-dimension, there exist distributions for which, under ERM, the expected excess risk decays slower than any prescribed positive function tending to zero.

2. Formal Definitions and Mathematical Formulations

Let $0$9 be i.i.d. samples from $1$0. ERM selects

$1$1

The universal learning curve is the function $1$2, where the expectation is with respect to the sample, and "universal" means this bound holds for every fixed (but arbitrary) distribution $1$3.

The trichotomy states:

  • If $1$4, for any $1$5:

$1$6

for constants $1$7 depending on $1$8 and $1$9.

  • If HH0 and HH1:

HH2

(i.e., for every HH3, HH4 as HH5).

  • If HH6, then for any pre-specified rate HH7, there is a distribution HH8 such that

HH9

for infinitely many E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]0.

3. Combinatorial Characterization

The rate regime is dictated entirely by two class properties:

  • Finite hypothesis space: Implies exponential rate.
  • Infinite but VC-finite: Implies super-root rate, with the exact speed always strictly better than E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]1 (as in classical uniform PAC rates), but never exponential.
  • Infinite VC-dimension: Allows for distributions where the rate cannot be bounded by any prescribed function—there is no universal guarantee on the speed of error decay.

This classification is sharp and exhaustive for all E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]2.

4. Target- and Bayes-Dependent Universal Rates

The paper supplies further subdivisions into target-dependent and Bayes-dependent universal rates, refining the analysis to the structure around the best-in-class or Bayes-optimal function.

  • Target-dependent trichotomy: For a given "target" E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]3, the achievable rate for distributions where E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]4 is the best can be classified by properties of how the "near-best" competitors of E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]5 are organized (specifically, whether their set has finite VC dimension or not).
  • Bayes-dependent trichotomy: Considering the true Bayes classifier E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]6, the rates depend on whether there exists an infinite (VC-)eluder sequence centered at E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]7—a set witnessing sustainable disagreement among functions in E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]8.

These refinements provide a more finely grained view of distribution-dependent universal rates and clarify when ERM may adapt to "easy" or "hard" distributions for a fixed E[er(hERM)infhHer(h)]\mathbb{E}\left[\mathrm{er}(h_{\mathrm{ERM}}) - \inf_{h' \in H}\mathrm{er}(h')\right]9.

5. Contrast to the Realizable Universal Rates

In the realizable case (where nn0 is supported on some nn1), "Universal Rates of Empirical Risk Minimization" (Hanneke et al., 2024) establishes a tetrachotomy (exponential, nn2, nn3, arbitrarily slow), with a richer spectrum of rates possible due to the absence of agnostic noise. The agnostic trichotomy is more compact: in the non-realizable case, the presence of unavoidable noise or model misspecification rules out the intermediate nn4 rate for ERM; unless nn5 is finite, no faster than nn6 rate can be guaranteed. This confirms that agnostic learning is strictly harder—fast rates require both limited class complexity and zero noise.

6. Implications and Applications

  • Practical interpretation: For any nn7 in applied classification, the agnostic universal learning rate achieved by ERM is dictated purely by the VC dimension and finiteness. Thus, practitioners using ERM should expect, for general distributions:
    • Exponential error decay if the class is finite (rare in practice).
    • Strictly super-root convergence (but not exponential and not nn8) if nn9 is infinite but has finite VC.
    • No universal guarantees if PP0 has infinite VC-dimension.
  • Model selection: These trichotomy results justify the central importance of controlling class complexity—not just for worst-case PAC theory, but for universal, every-distribution (distribution-wise) learning curve guarantees.
  • Theory: These results provide definitive negative evidence against the possibility of PP1 universal excess risk rates for ERM in agnostic learning—regardless of the particular infinite VC class chosen.
  • Algorithm design: For improper learners or data-dependent methods, faster rates may sometimes be achievable in special noise/margin or target-adaptive regimes, but ERM is fundamentally limited according to this trichotomy.

Summary Table: Agnostic Universal Rates for ERM

Rate Regime Class Characterization Example
PP2 PP3 Finite threshold/interval classes
PP4 PP5, PP6 Halfspaces, finite-degree polynomials
Arbitrarily slow PP7 Unrestricted indicator classes

References to Key Results

  • The main trichotomy theorem: Section 3, "Agnostic universal rates for ERM, target-independent case."
  • Target-dependent and Bayes-dependent trichotomies: Section 4, Theorem 2 (target-dependent), Theorem 3 (Bayes-dependent).
  • Proofs, combinatorial definitions, and concrete class examples: throughout sections 3–5; for formal statements and detailed definitions of (VC-)eluder sequences centered at PP8 or the Bayes classifier, see the supplemental definitions.

Conclusion:

The work establishes a precise trichotomy for agnostic universal rates of ERM: exactly exponential, strictly super-root (but not PP9), or arbitrarily slow—fully characterized by standard class complexity (finite ene^{-n}0, VC-dimension) and new structural properties related to the neighborhood of optimal classifiers. This definitively resolves the structure of agnostic universal learning in ERM and provides clear compositional guidelines for statistical learning, model complexity control, and the design of robust empirical risk minimization systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agnostic Universal Rates of ERM.

Continue Learning

We haven't generated follow-up questions for this topic yet.