Empirical Risk Minimization: Rate Tetrachotomy

Updated 13 November 2025

Empirical Risk Minimization (ERM) is a fundamental principle that fits predictive models by minimizing average loss over observed data, encompassing both parametric and nonparametric methods.
The universal rate tetrachotomy classifies ERM learning curves into four regimes—exponential, linear, almost-linear, and arbitrarily slow—based on combinatorial properties like VC and star-eluder dimensions.
Refined combinatorial dimensions such as eluder, star-eluder, and VC-eluder provide sharp bounds that quantify how hypothesis class complexity governs the decay of prediction error.

Empirical Risk Minimization (ERM) is a foundational principle in statistical learning theory and modern machine learning, providing the central framework for fitting a predictive model to data by minimizing the average loss over observed samples. Formally, given a sample of n i.i.d. data points $(x_1, y_1), ..., (x_n, y_n)$ and a hypothesis class $\mathcal{F}$ , ERM selects $\hat{f}_n \in \mathcal{F}$ to minimize the empirical risk, $\frac{1}{n}\sum_{i=1}^n \ell(\hat{f}_n(x_i), y_i)$ . The principle underlies both classical parametric estimation and nonparametric function fitting, encompassing regression, classification, and diverse extensions. The theoretical properties, convergence rates, and the impact of complexity, data dependencies, loss structure, and algorithmic constraints are subjects of ongoing research. ERM’s performance is now characterized by sharp rate theorems connecting statistical error decay to combinatorial, geometric, and probabilistic properties of the hypothesis class and the data-generating mechanism.

1. Universal Rate Tetrachotomy for ERM

The realizable-case error decay for ERM is governed by a four-way dichotomy, as proved in (Hanneke et al., 3 Dec 2024). Given a binary concept class $H \subseteq \{0,1\}^{\mathcal{X}}$ and realizable distributions ( $P$ such that $\exists h^* \in H$ with $y = h^*(x)$ a.s.), the possible exact rates for the expected error $\operatorname{er}(n) := \mathbb{E}_{S_n \sim P^n}[P\{x:h_n(x)\ne y\}]$ via any ERM are:

Rate	Combinatorial condition	Examples
$e^{-n}$	$\|H\|<\infty$	Finite classes, constant functions
$1/n$	Infinite $H$ , no infinite star-eluder sequence	Threshold classifiers on $\mathbb{N}$
$(\log n)/n$	Infinite star-eluder, $\mathrm{VC}(H)<\infty$	Singleton classifiers, 2D halfspaces
Arbitrarily slow	$\mathrm{VC}(H)=\infty$	Highly symmetric block classes

This “tetrachotomy theorem” determines: ERM achieves exponential error decay iff $H$ is finite; linear iff $H$ is infinite but lacks infinite star-eluder sequences; almost-linear iff star-eluder sequences exist but VC-dimension is finite; and arbitrarily slow otherwise. These regimes are characterized by new combinatorial dimensions (eluder, star-eluder, VC-eluder), which provide sharp and fine-grained constant-factor bounds.

2. Combinatorial Structures Governing ERM Rates

Three structural notions are critical:

Eluder sequence and dimension $E(H)$ : A sequence $((x_i,y_i))$ where for each $k$ , some $h_k$ fits all previous labels except at $x_k$ . $E(H)<\infty$ iff $H$ is finite.
Star-eluder sequence and dimension $\mathrm{SE}(H)$ : For target $h^*$ , blocks of data become "star sets"—every point can be flipped by some $h$ in the version space. An infinite star-eluder sequence implies sub-linear regime; $\mathrm{SE}(H)<\infty$ refines bounds.
VC-eluder sequence and dimension $\mathrm{VCE}(H)$ : Block structure requiring shattering; infinite VC-eluder is equivalent to $\mathrm{VC}(H)=\infty$ . Whenever $\mathrm{VCE}(H)<\infty$ , for large $n$ :

$\alpha \cdot \mathrm{VCE}(H)/n \leq \sup_P \mathbb{E}[er(n)] \leq \beta \cdot (\mathrm{VCE}(H) \log n) / n + o(1)$

with sharp constants.

These dimensions quantify the minimal combinatorial obstacles to faster learning. Classical VC theory is of particular relevance in the $(\log n)/n$ and arbitrarily slow regimes.

3. ERM Rate Regimes: Canonical Examples

Specific example classes display prototypical rates:

Finite classes ( $|H|<\infty$ ): Exponential convergence $e^{-n}$ .
Infinite threshold classifiers (no infinite star-eluder): Linear $1/n$ rate, e.g., for $H=\{h_t(x)=1[x\geq t]\}$ , the worst-case ERM rule yields only $1/n$ decay.
Singleton functions ( $\mathrm{VC}=1$ , infinite star-eluder): $H = \{h_t(x) = 1[x = t]\}$ achieves $(\log n)/n$ rate—no faster universal ERM is possible.
Classes with VC-dimension infinite: Artificial unions of large, symmetric blocks ( $VCE(H)=\infty$ ) admit arbitrarily slow ERM convergence, even when the Littlestone dimension is finite.

The above shows that VC-dimension alone is insufficient for sharp rate separation—star-eluder and VC-eluder properties determine the precise cutoff between linear and log-linear rates.

4. Comparison to Classical Uniform PAC Rates

Traditional PAC learning theory (uniform guarantees over all distributions) invokes the VC dimension:

$\mathrm{VC}(H)<\infty$ yields uniform rate $O(\mathrm{VC} \log n / n)$ ;
$\mathrm{VC}(H)=\infty$ precludes uniform consistency.

Recent universal learning models [Bousquet et al. 2021] reveal, for optimal (possibly improper) algorithms, a trichotomy: exponential ( $e^{-n}$ ), linear ($1/n$), or arbitrarily slow. However, ERM introduces a fourth regime— $(\log n) / n$ —because it is not always optimal: some classes allow optimal rates $e^{-n}$ , but all ERM must pay a $1/n$ or $(\log n)/n$ penalty in the worst-case distribution. The star-eluder structure demarcates this suboptimality, refining both classical VC theory and universal learning theory.

In practical learning curves, ERM often performs much faster than the pessimistic $O(\mathrm{VC} \log n / n)$ PAC bound, but the combinatorial structure explains persistence of sub-linear rates for certain classes and targets.

5. Fine-Grained Bounds and Distribution-Dependent Behavior

The sharp constant-factor bounds give:

For classes with finite VC-eluder dimension,

$\alpha \cdot \mathrm{VCE}(H)/n \leq \sup_P \mathbb{E}[er(n)] \leq \beta \cdot (\mathrm{VCE}(H)\log n)/n + o(1)$

If also $\mathrm{SE}(H)<\infty$ , the upper bound improves to

$\beta' \cdot \frac{\mathrm{VCE}(H)}{n} \cdot \log\left(\frac{\mathrm{SE}(H)}{\mathrm{VCE}(H)}\right)$

No distribution-free constant exists for $\mathrm{VCE}(H)=\infty$ .

This quantifies how combinatorial complexity influences excess risk decay, with distribution and target affecting constants but not rates for each regime.

6. Significance, Practical Implications, and Theoretical Limits

The four-regime classification fundamentally answers which learning curves are possible for ERM in the realizable case.
ERM is sometimes provably suboptimal compared to distribution-aware/improper algorithms.
The presence of infinite VC-eluder sequences marks classes where universal learnability is impossible; for many practical classes, faster rates are achievable.
Real-world learning curves often exceed uniform PAC expectations, underscoring the importance of distribution-dependent and target-dependent spectral combinatorics (eluder, star-eluder).
These results enable precise rate calculations for concrete concept classes, guide expectations for algorithmic performance, and delimit the scope of ERM versus optimal procedures.

In summary, the universal rate tetrachotomy for ERM provides a combinatorial and statistical foundation for understanding the possible learning curves and their sharp boundaries, connecting ERM’s behavior to target-dependent, class-dependent, and distribution-dependent properties that refine and generalize classical VC theory and PAC learning principles.

PDF Markdown Chat (Pro)

References (1)

Universal Rates of Empirical Risk Minimization (2024)

Follow Topic

Get notified by email when new papers are published related to Empirical Risk Minimization (ERM).