Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 108 tok/s
Gemini 3.0 Pro 55 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 205 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Empirical Risk Minimization: Rate Tetrachotomy

Updated 13 November 2025
  • Empirical Risk Minimization (ERM) is a fundamental principle that fits predictive models by minimizing average loss over observed data, encompassing both parametric and nonparametric methods.
  • The universal rate tetrachotomy classifies ERM learning curves into four regimes—exponential, linear, almost-linear, and arbitrarily slow—based on combinatorial properties like VC and star-eluder dimensions.
  • Refined combinatorial dimensions such as eluder, star-eluder, and VC-eluder provide sharp bounds that quantify how hypothesis class complexity governs the decay of prediction error.

Empirical Risk Minimization (ERM) is a foundational principle in statistical learning theory and modern machine learning, providing the central framework for fitting a predictive model to data by minimizing the average loss over observed samples. Formally, given a sample of n i.i.d. data points (x1,y1),...,(xn,yn)(x_1, y_1), ..., (x_n, y_n) and a hypothesis class F\mathcal{F}, ERM selects f^nF\hat{f}_n \in \mathcal{F} to minimize the empirical risk, 1ni=1n(f^n(xi),yi)\frac{1}{n}\sum_{i=1}^n \ell(\hat{f}_n(x_i), y_i). The principle underlies both classical parametric estimation and nonparametric function fitting, encompassing regression, classification, and diverse extensions. The theoretical properties, convergence rates, and the impact of complexity, data dependencies, loss structure, and algorithmic constraints are subjects of ongoing research. ERM’s performance is now characterized by sharp rate theorems connecting statistical error decay to combinatorial, geometric, and probabilistic properties of the hypothesis class and the data-generating mechanism.

1. Universal Rate Tetrachotomy for ERM

The realizable-case error decay for ERM is governed by a four-way dichotomy, as proved in (Hanneke et al., 3 Dec 2024). Given a binary concept class H{0,1}XH \subseteq \{0,1\}^{\mathcal{X}} and realizable distributions (PP such that hH\exists h^* \in H with y=h(x)y = h^*(x) a.s.), the possible exact rates for the expected error er(n):=ESnPn[P{x:hn(x)y}]\operatorname{er}(n) := \mathbb{E}_{S_n \sim P^n}[P\{x:h_n(x)\ne y\}] via any ERM are:

Rate Combinatorial condition Examples
ene^{-n} H<|H|<\infty Finite classes, constant functions
$1/n$ Infinite HH, no infinite star-eluder sequence Threshold classifiers on N\mathbb{N}
(logn)/n(\log n)/n Infinite star-eluder, VC(H)<\mathrm{VC}(H)<\infty Singleton classifiers, 2D halfspaces
Arbitrarily slow VC(H)=\mathrm{VC}(H)=\infty Highly symmetric block classes

This “tetrachotomy theorem” determines: ERM achieves exponential error decay iff HH is finite; linear iff HH is infinite but lacks infinite star-eluder sequences; almost-linear iff star-eluder sequences exist but VC-dimension is finite; and arbitrarily slow otherwise. These regimes are characterized by new combinatorial dimensions (eluder, star-eluder, VC-eluder), which provide sharp and fine-grained constant-factor bounds.

2. Combinatorial Structures Governing ERM Rates

Three structural notions are critical:

  • Eluder sequence and dimension E(H)E(H): A sequence ((xi,yi))((x_i,y_i)) where for each kk, some hkh_k fits all previous labels except at xkx_k. E(H)<E(H)<\infty iff HH is finite.
  • Star-eluder sequence and dimension SE(H)\mathrm{SE}(H): For target hh^*, blocks of data become "star sets"—every point can be flipped by some hh in the version space. An infinite star-eluder sequence implies sub-linear regime; SE(H)<\mathrm{SE}(H)<\infty refines bounds.
  • VC-eluder sequence and dimension VCE(H)\mathrm{VCE}(H): Block structure requiring shattering; infinite VC-eluder is equivalent to VC(H)=\mathrm{VC}(H)=\infty. Whenever VCE(H)<\mathrm{VCE}(H)<\infty, for large nn:

αVCE(H)/nsupPE[er(n)]β(VCE(H)logn)/n+o(1)\alpha \cdot \mathrm{VCE}(H)/n \leq \sup_P \mathbb{E}[er(n)] \leq \beta \cdot (\mathrm{VCE}(H) \log n) / n + o(1)

with sharp constants.

These dimensions quantify the minimal combinatorial obstacles to faster learning. Classical VC theory is of particular relevance in the (logn)/n(\log n)/n and arbitrarily slow regimes.

3. ERM Rate Regimes: Canonical Examples

Specific example classes display prototypical rates:

  • Finite classes (H<|H|<\infty): Exponential convergence ene^{-n}.
  • Infinite threshold classifiers (no infinite star-eluder): Linear $1/n$ rate, e.g., for H={ht(x)=1[xt]}H=\{h_t(x)=1[x\geq t]\}, the worst-case ERM rule yields only $1/n$ decay.
  • Singleton functions (VC=1\mathrm{VC}=1, infinite star-eluder): H={ht(x)=1[x=t]}H = \{h_t(x) = 1[x = t]\} achieves (logn)/n(\log n)/n rate—no faster universal ERM is possible.
  • Classes with VC-dimension infinite: Artificial unions of large, symmetric blocks (VCE(H)=VCE(H)=\infty) admit arbitrarily slow ERM convergence, even when the Littlestone dimension is finite.

The above shows that VC-dimension alone is insufficient for sharp rate separation—star-eluder and VC-eluder properties determine the precise cutoff between linear and log-linear rates.

4. Comparison to Classical Uniform PAC Rates

Traditional PAC learning theory (uniform guarantees over all distributions) invokes the VC dimension:

  • VC(H)<\mathrm{VC}(H)<\infty yields uniform rate O(VClogn/n)O(\mathrm{VC} \log n / n);
  • VC(H)=\mathrm{VC}(H)=\infty precludes uniform consistency.

Recent universal learning models [Bousquet et al. 2021] reveal, for optimal (possibly improper) algorithms, a trichotomy: exponential (ene^{-n}), linear ($1/n$), or arbitrarily slow. However, ERM introduces a fourth regime—(logn)/n(\log n) / n—because it is not always optimal: some classes allow optimal rates ene^{-n}, but all ERM must pay a $1/n$ or (logn)/n(\log n)/n penalty in the worst-case distribution. The star-eluder structure demarcates this suboptimality, refining both classical VC theory and universal learning theory.

In practical learning curves, ERM often performs much faster than the pessimistic O(VClogn/n)O(\mathrm{VC} \log n / n) PAC bound, but the combinatorial structure explains persistence of sub-linear rates for certain classes and targets.

5. Fine-Grained Bounds and Distribution-Dependent Behavior

The sharp constant-factor bounds give:

  • For classes with finite VC-eluder dimension,

αVCE(H)/nsupPE[er(n)]β(VCE(H)logn)/n+o(1)\alpha \cdot \mathrm{VCE}(H)/n \leq \sup_P \mathbb{E}[er(n)] \leq \beta \cdot (\mathrm{VCE}(H)\log n)/n + o(1)

  • If also SE(H)<\mathrm{SE}(H)<\infty, the upper bound improves to

βVCE(H)nlog(SE(H)VCE(H))\beta' \cdot \frac{\mathrm{VCE}(H)}{n} \cdot \log\left(\frac{\mathrm{SE}(H)}{\mathrm{VCE}(H)}\right)

  • No distribution-free constant exists for VCE(H)=\mathrm{VCE}(H)=\infty.

This quantifies how combinatorial complexity influences excess risk decay, with distribution and target affecting constants but not rates for each regime.

6. Significance, Practical Implications, and Theoretical Limits

  • The four-regime classification fundamentally answers which learning curves are possible for ERM in the realizable case.
  • ERM is sometimes provably suboptimal compared to distribution-aware/improper algorithms.
  • The presence of infinite VC-eluder sequences marks classes where universal learnability is impossible; for many practical classes, faster rates are achievable.
  • Real-world learning curves often exceed uniform PAC expectations, underscoring the importance of distribution-dependent and target-dependent spectral combinatorics (eluder, star-eluder).
  • These results enable precise rate calculations for concrete concept classes, guide expectations for algorithmic performance, and delimit the scope of ERM versus optimal procedures.

In summary, the universal rate tetrachotomy for ERM provides a combinatorial and statistical foundation for understanding the possible learning curves and their sharp boundaries, connecting ERM’s behavior to target-dependent, class-dependent, and distribution-dependent properties that refine and generalize classical VC theory and PAC learning principles.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Empirical Risk Minimization (ERM).