Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Empirical Risk Minimization Explained

Updated 12 July 2025
  • Empirical Risk Minimization (ERM) is a principle that selects hypotheses by minimizing the average loss over observed data.
  • Recent research shows that ERM’s universal error decay falls into one of four regimes—exponential, linear, logarithmically slowed linear, or arbitrarily slow—based on the complexity of the hypothesis class.
  • New combinatorial measures like the star-eluder and VC-eluder dimensions provide sharp bounds on learning rates, guiding when to expect rapid convergence or inherent performance limitations.

Empirical risk minimization (ERM) is a foundational principle in statistical learning theory in which one selects a hypothesis from a class by minimizing the average (empirical) loss over observed data. While ERM has provided the basis for many widely used machine learning algorithms and underpins classical PAC theory, a precise understanding of the rates at which ERM's learning curves decay with increasing sample size in the universal (distribution-dependent) sense has remained an open area. Recent research shows that, even in the realizable case (i.e., there is a target function with zero error), the universal rate of error decay for ERM exhibits a fundamental tetrachotomy: only four decay regimes are possible—exponential, linear, logarithmically slowed linear, or arbitrarily slow rates. These regimes are characterized by novel combinatorial structures (eluder, star-eluder, and VC-eluder sequences), and the introduction of new dimensions (particularly the star-eluder and VC-eluder dimensions) enables sharp asymptotic bounds whenever feasible.

1. Tetrachotomy of Universal ERM Rates

ERM's universal learning rate, defined as the decay of expected error for the best possible distribution with zero target risk, falls into exactly one of four distinct regimes, depending on the structure of the hypothesis (concept) class H\mathcal{H}:

  1. Exponential rate (exp(n)\exp(-n)): If and only if H\mathcal{H} is finite, ERM achieves error that decays exponentially fast with the sample size nn. For finite classes, a uniform union bound yields a generalization error of order Henγ|\mathcal{H}| e^{-n\gamma} for some constant γ>0\gamma>0.
  2. Linear rate ($1/n$): If H\mathcal{H} is infinite, but does not admit an infinite star-eluder sequence (defined below), ERM achieves a linear $1/n$ error decay.
  3. Logarithmically slowed linear rate (log(n)/n\log(n)/n): If H\mathcal{H} has an infinite star-eluder sequence but not an infinite VC-eluder sequence (so H\mathcal{H} has finite VC dimension but is more complex than parametric classes), the universal rate slows to (logn)/n(\log n) / n.
  4. Arbitrarily slow rate: If H\mathcal{H} admits an infinite VC-eluder sequence (i.e., has infinite VC dimension), then the universal ERM rate can be as slow as any function R(n)0R(n) \rightarrow 0; no uniform polynomial or exponential rate can be assured.

Formally, for any nontrivial concept class, ERM's universal rate is exactly one of these four, as stated in Theorems 1 and 2 of the cited work (2412.02810).

2. Combinatorial Characterization of Concept Classes

The position of a concept class H\mathcal{H} in the tetrachotomy relies on new combinatorial structures that extend classical complexity notions like VC dimension. The paper introduces three types of sequences:

  • Eluder sequences: A sequence (x1,y1),(x2,y2),(x_1, y_1), (x_2, y_2), \ldots such that for each kk, there is hHh \in \mathcal{H} matching previous examples and erring on the kkth point. Infinite existence rules out the exponential regime.
  • Star-eluder sequences: Eluder sequences structured in blocks, each forming a star set in the version space, i.e., for each point in the block there is a hypothesis differing only at that point.
  • VC-eluder sequences: Chains of blocks, each not only star sets but also shattered by the version space, capturing the richness required for arbitrarily slow rates.

The presence or absence of these infinite sequences partitions H\mathcal{H} as follows:

Infinite Eluder? Infinite Star-Eluder? Infinite VC-Eluder? Universal ERM Rate
No exp(n)\exp(-n) (finite case)
Yes No $1/n$
Yes Yes No logn/n\log n / n
Yes Yes Yes Arbitrarily slow

A class with an infinite VC-eluder sequence is exactly one with infinite VC dimension, meaning ERM cannot guarantee any fixed rate of improvement.

3. Complexity Dimensions and Sharp Constants

In order to precisely quantify the learning rates, the paper introduces two new dimensions:

  • Star-Eluder Dimension (SE): Measures the maximal size of a block in a star-eluder sequence, directly influencing the constant in (logn)/n(\log n)/n rates. When SE is finite and VC-eluder dimension is also finite, the upper and lower bounds for ERM are (up to constants):

E[er]constVCElognn\mathbb{E}[\mathrm{er}] \leq \frac{\mathrm{const} \cdot \mathrm{VCE} \cdot \log n}{n}

  • VC-Eluder Dimension (VCE): Captures the maximal length of a VC-eluder chain. If VCE is finite, the above linear or slowed rate applies with sharp constants; if infinite, arbitrarily slow rates result.

This refinement enables the removal of loose logarithmic factors known from classical PAC/VC theory and, in favorable cases, yields sharp asymptotic prefactors.

4. Regime Examples and Implications for Algorithm Design

  • Finite classes: Standard multiclass classification with KK labels and K<K<\infty is in the exponential regime.
  • Parametric models (e.g., threshold functions, intervals): These typically lack infinite star-eluder sequences, so ERM achieves the "fast" $1/n$ rate.
  • Rich structured classes (finite VC, but admits "star" sequences): Examples include classes of unions of intervals or polygons, with error rate slowing to (logn)/n(\log n)/n.
  • Infinite VC dimension (e.g., all functions on a set): ERM may learn arbitrarily slowly; in such problems, no polynomial or exponential guarantee can be obtained.

For practitioners, this tetrachotomy means that for certain classes, even when a model is realizable and optimization is perfect, ERM cannot guarantee faster learning than determined by the position of the class in the hierarchy defined by these combinatorial dimensions.

5. Separation from Optimal Universal Rates and Limitations of ERM

The results highlight a gap between the best possible (sometimes improper) universal learning rates and those achieved specifically by ERM. For some classes, ERM is universally suboptimal: other procedures may achieve exponentially faster convergence, even when ERM's best possible rate is only (logn)/n(\log n)/n. This explains observed discrepancies in learning curve behavior between different algorithms and underscores the importance of understanding the combinatorial structure underlying H\mathcal{H}.

The refined analysis provided by the star-eluder and VC-eluder dimensions also suggests that improvements to ERM may be possible when these structures are absent, while in other cases, alternative algorithms or regularization may be necessary.

6. Broader Impact and Theoretical Significance

By rigorously characterizing all possible universal rates for ERM using novel combinatorial structures, these results clarify the fundamental limits of ERM-based learning in the realizable setting. This guides both the theoretical analysis of learning curves and the practical design of learning algorithms, showing when ERM's performance is inherently limited by class complexity, and when it can be expected to attain rapid error decay. The analysis also connects to and refines classical results in distribution-free learning, suggesting further areas of exploration for improved or alternative learning procedures.

Summary Table: Universal ERM Rates and Complexity Structures

Rate Regime H\mathcal{H} Property Characterization
exp(n)\exp(-n) Finite class No infinite eluder sequence
$1/n$ Infinite, no infinite star-eluder sequence Infinite eluder, finite star-eluder sequence
(logn)/n(\log n)/n Infinite, star-eluder, no infinite VC-eluder Infinite star-eluder, finite VC-eluder sequence (finite VC dim)
Arbitrarily slow Infinite VC-eluder sequence (infinite VC dim) Cannot guarantee polynomial/exponential rate
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)