Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Empirical Risk Minimization Explained

Updated 12 July 2025

Empirical Risk Minimization (ERM) is a principle that selects hypotheses by minimizing the average loss over observed data.
Recent research shows that ERM’s universal error decay falls into one of four regimes—exponential, linear, logarithmically slowed linear, or arbitrarily slow—based on the complexity of the hypothesis class.
New combinatorial measures like the star-eluder and VC-eluder dimensions provide sharp bounds on learning rates, guiding when to expect rapid convergence or inherent performance limitations.

Empirical risk minimization (ERM) is a foundational principle in statistical learning theory in which one selects a hypothesis from a class by minimizing the average (empirical) loss over observed data. While ERM has provided the basis for many widely used machine learning algorithms and underpins classical PAC theory, a precise understanding of the rates at which ERM's learning curves decay with increasing sample size in the universal (distribution-dependent) sense has remained an open area. Recent research shows that, even in the realizable case (i.e., there is a target function with zero error), the universal rate of error decay for ERM exhibits a fundamental tetrachotomy: only four decay regimes are possible—exponential, linear, logarithmically slowed linear, or arbitrarily slow rates. These regimes are characterized by novel combinatorial structures (eluder, star-eluder, and VC-eluder sequences), and the introduction of new dimensions (particularly the star-eluder and VC-eluder dimensions) enables sharp asymptotic bounds whenever feasible.

1. Tetrachotomy of Universal ERM Rates

ERM's universal learning rate, defined as the decay of expected error for the best possible distribution with zero target risk, falls into exactly one of four distinct regimes, depending on the structure of the hypothesis (concept) class $\mathcal{H}$ :

Exponential rate ( $\exp(-n)$ ): If and only if $\mathcal{H}$ is finite, ERM achieves error that decays exponentially fast with the sample size $n$ . For finite classes, a uniform union bound yields a generalization error of order $|\mathcal{H}| e^{-n\gamma}$ for some constant $\gamma>0$ .
Linear rate ($1/n$): If $\mathcal{H}$ is infinite, but does not admit an infinite star-eluder sequence (defined below), ERM achieves a linear $1/n$ error decay.
Logarithmically slowed linear rate ( $\log(n)/n$ ): If $\mathcal{H}$ has an infinite star-eluder sequence but not an infinite VC-eluder sequence (so $\mathcal{H}$ has finite VC dimension but is more complex than parametric classes), the universal rate slows to $(\log n) / n$ .
Arbitrarily slow rate: If $\mathcal{H}$ admits an infinite VC-eluder sequence (i.e., has infinite VC dimension), then the universal ERM rate can be as slow as any function $R(n) \rightarrow 0$ ; no uniform polynomial or exponential rate can be assured.

Formally, for any nontrivial concept class, ERM's universal rate is exactly one of these four, as stated in Theorems 1 and 2 of the cited work (2412.02810).

2. Combinatorial Characterization of Concept Classes

The position of a concept class $\mathcal{H}$ in the tetrachotomy relies on new combinatorial structures that extend classical complexity notions like VC dimension. The paper introduces three types of sequences:

Eluder sequences: A sequence $(x_1, y_1), (x_2, y_2), \ldots$ such that for each $k$ , there is $h \in \mathcal{H}$ matching previous examples and erring on the $k$ th point. Infinite existence rules out the exponential regime.
Star-eluder sequences: Eluder sequences structured in blocks, each forming a star set in the version space, i.e., for each point in the block there is a hypothesis differing only at that point.
VC-eluder sequences: Chains of blocks, each not only star sets but also shattered by the version space, capturing the richness required for arbitrarily slow rates.

The presence or absence of these infinite sequences partitions $\mathcal{H}$ as follows:

Infinite Eluder?	Infinite Star-Eluder?	Infinite VC-Eluder?	Universal ERM Rate
No	–	–	$\exp(-n)$ (finite case)
Yes	No	–	$1/n$
Yes	Yes	No	$\log n / n$
Yes	Yes	Yes	Arbitrarily slow

A class with an infinite VC-eluder sequence is exactly one with infinite VC dimension, meaning ERM cannot guarantee any fixed rate of improvement.

3. Complexity Dimensions and Sharp Constants

In order to precisely quantify the learning rates, the paper introduces two new dimensions:

Star-Eluder Dimension (SE): Measures the maximal size of a block in a star-eluder sequence, directly influencing the constant in $(\log n)/n$ rates. When SE is finite and VC-eluder dimension is also finite, the upper and lower bounds for ERM are (up to constants):

$\mathbb{E}[\mathrm{er}] \leq \frac{\mathrm{const} \cdot \mathrm{VCE} \cdot \log n}{n}$

VC-Eluder Dimension (VCE): Captures the maximal length of a VC-eluder chain. If VCE is finite, the above linear or slowed rate applies with sharp constants; if infinite, arbitrarily slow rates result.

This refinement enables the removal of loose logarithmic factors known from classical PAC/VC theory and, in favorable cases, yields sharp asymptotic prefactors.

4. Regime Examples and Implications for Algorithm Design

Finite classes: Standard multiclass classification with $K$ labels and $K<\infty$ is in the exponential regime.
Parametric models (e.g., threshold functions, intervals): These typically lack infinite star-eluder sequences, so ERM achieves the "fast" $1/n$ rate.
Rich structured classes (finite VC, but admits "star" sequences): Examples include classes of unions of intervals or polygons, with error rate slowing to $(\log n)/n$ .
Infinite VC dimension (e.g., all functions on a set): ERM may learn arbitrarily slowly; in such problems, no polynomial or exponential guarantee can be obtained.

For practitioners, this tetrachotomy means that for certain classes, even when a model is realizable and optimization is perfect, ERM cannot guarantee faster learning than determined by the position of the class in the hierarchy defined by these combinatorial dimensions.

5. Separation from Optimal Universal Rates and Limitations of ERM

The results highlight a gap between the best possible (sometimes improper) universal learning rates and those achieved specifically by ERM. For some classes, ERM is universally suboptimal: other procedures may achieve exponentially faster convergence, even when ERM's best possible rate is only $(\log n)/n$ . This explains observed discrepancies in learning curve behavior between different algorithms and underscores the importance of understanding the combinatorial structure underlying $\mathcal{H}$ .

The refined analysis provided by the star-eluder and VC-eluder dimensions also suggests that improvements to ERM may be possible when these structures are absent, while in other cases, alternative algorithms or regularization may be necessary.

6. Broader Impact and Theoretical Significance

By rigorously characterizing all possible universal rates for ERM using novel combinatorial structures, these results clarify the fundamental limits of ERM-based learning in the realizable setting. This guides both the theoretical analysis of learning curves and the practical design of learning algorithms, showing when ERM's performance is inherently limited by class complexity, and when it can be expected to attain rapid error decay. The analysis also connects to and refines classical results in distribution-free learning, suggesting further areas of exploration for improved or alternative learning procedures.

Summary Table: Universal ERM Rates and Complexity Structures

Rate Regime	$\mathcal{H}$ Property	Characterization
$\exp(-n)$	Finite class	No infinite eluder sequence
$1/n$	Infinite, no infinite star-eluder sequence	Infinite eluder, finite star-eluder sequence
$(\log n)/n$	Infinite, star-eluder, no infinite VC-eluder	Infinite star-eluder, finite VC-eluder sequence (finite VC dim)
Arbitrarily slow	Infinite VC-eluder sequence (infinite VC dim)	Cannot guarantee polynomial/exponential rate

PDF Markdown Chat (Upgrade)

References (1)

Universal Rates of Empirical Risk Minimization (2024)