Empirical Risk Minimization Objective

Updated 3 June 2026

Empirical Risk Minimization (ERM) is a framework that approximates the unknown population risk by minimizing the average loss over observed data samples.
It underlies various machine learning paradigms, including supervised, robust, fairness-aware, and explainable models, with proven convergence rates under specific conditions.
Advanced variations like Tilted ERM and Robust ERM introduce mechanisms for outlier resistance and fairness, broadening its applicability to non-i.i.d., time-dependent, and relational data settings.

Empirical risk minimization (ERM) is the central objective in modern statistical learning theory and machine learning, defining the principle by which algorithms select predictive models based on finite data samples. ERM formalizes the substitution of intractable population risk minimization—optimization over an unknown data-generating distribution—by tractable minimization over the sample mean loss computed from observed data. The ERM framework underlies not only standard supervised learning but also recurrent, dynamical, robust, relational, fairness-aware, and explainable machine learning paradigms.

1. Formal Definition and Basic Objective

Let $(x_i, y_i)_{i=1}^n$ be observed data, a loss function $\ell(f(x), y)$ , and a hypothesis class $\mathcal{F}$ . ERM seeks

$\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$

where the sum is the empirical risk, approximating the population risk

$R(f) = \mathbb{E}_{(x, y) \sim P}\left[\ell(f(x), y)\right]$

In the case of parameterized models $f_\theta$ , the optimizer becomes

$\hat\theta_n = \arg\min_{\theta} \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i)$

For time series prediction, with recursive forecasters $f_{\theta, t}$ , the ERM selects

$\hat\theta = \arg\min_{\theta \in \Theta} \frac{1}{T} \sum_{t=1}^T L(Y_t, f_{\theta, t})$

where $L$ is typically a Bregman-type loss, e.g.,

$\ell(f(x), y)$ 0

(Brownlees et al., 2021).

2. Fundamental Theoretical Properties

ERM provides consistency and optimality guarantees under suitable conditions:

For i.i.d. data and bounded/sub-Gaussian loss functions, the excess risk decays at $\ell(f(x), y)$ 1 (Brownlees et al., 2014, Frostig et al., 2014, Bibaut et al., 2021).
In time series settings, assuming stationarity and strong mixing, ERM achieves an oracle inequality: $\ell(f(x), y)$ 2 indicating rate-optimal, non-asymptotic convergence of ERM to the best risk achievable in the model class (Brownlees et al., 2021).

For convex, smooth losses and strongly convex risk, the ERM estimator matches the statistical minimax rate, $\ell(f(x), y)$ 3, with explicit finite-sample constants (Frostig et al., 2014).

3. Generalizations and Robustifications

ERM is the basis for several generalizations:

Tilted ERM (TERM): Introduces a tilt parameter $\ell(f(x), y)$ 4 to weight individual losses, defining the tilted risk:

$\ell(f(x), y)$ 5

For $\ell(f(x), y)$ 6, recovers ERM; $\ell(f(x), y)$ 7 emphasizes large losses (worst-case/fairness); $\ell(f(x), y)$ 8 suppresses outliers (robustness). TERM interpolates between mean, max, and min loss, offering a smooth mechanism for robustness and fairness control (Li et al., 2020).

Robust ERM via Catoni's Estimator: Standard ERM's arithmetic mean is replaced with robust M-estimation. Catoni's loss estimator $\ell(f(x), y)$ 9 satisfies:

$\mathcal{F}$ 0

leading to excess risk bounds under heavy-tailed losses, maintaining $\mathcal{F}$ 1 rates even without boundedness (Brownlees et al., 2014).

Robust Newton Methods: Second-order ERM optimization can use robust mean estimators for both gradient and Hessian, enhancing statistical and algorithmic robustness to contamination (Ioannou et al., 2023).
Functional Risk Minimization (FRM): A strict generalization, where the loss is computed in function space, allowing per-sample functional perturbations and capturing richer noise models. ERM emerges as a special case when the functional variability is restricted to output noise (Alet et al., 2024).
Explainable ERM (EERM): Incorporates an information-theoretic regularization term, such as the conditional entropy of the predictions given user feedback, to balance predictive performance with subjective, user-dependent explainability (Zhang et al., 2020).

4. Empirical Risk Minimization in Structured and Non-i.i.d. Settings

Time Series and Dynamical Models: ERM is extended to settings where data exhibit dependence (e.g., stationary time series, ergodic processes). Here, the empirical risk is computed over recursive predictors, and the model class can be highly structured (e.g., regime-switching autoregressive models) (Brownlees et al., 2021). For dynamical systems, the empirical risk includes minimization over unknown initial conditions and addresses signal-noise separation under complexity (entropy) constraints on the model class (McGoff et al., 2016).
Relational and Graph Data: In non-i.i.d., relational data contexts (e.g., graphs), ERM is based on sampling subgraphs (via random walks, edge/vertex sampling), defining risk as the expected loss over sampled substructures (Veitch et al., 2018):

$\mathcal{F}$ 2

This enables mini-batch SGD procedures with unbiased gradient estimators over relational data.

Adaptively Collected Data: In bandit or adaptive designs, empirical risk minimization must incorporate inverse-propensity or importance weights to correct for non-uniform sampling. The IS-weighted ERM objective is

$\mathcal{F}$ 3

with theory providing rates parameterized by the maximum and average importance weight inflation (Bibaut et al., 2021).

5. Optimization and Algorithmic Implementations

ERM objectives are predominantly optimized via first-order (gradient descent, stochastic gradient descent) or second-order (Newton-type) methods. Key algorithmic aspects:

Batch and Stochastic Procedures: For convex, smooth losses, SGD and variance-reduced methods (e.g., Streaming SVRG) efficiently minimize ERM, attaining optimal sample complexity $\mathcal{F}$ 4 or $\mathcal{F}$ 5 depending on algorithm and problem structure (Frostig et al., 2014, Jadbabaie et al., 2020).
Single-Pass and Memory-Efficient Solvers: Streaming SVRG realizes the statistical rates of batch ERM using a single pass through the data and $\mathcal{F}$ 6 memory by combining staged reference gradient computation with variance-reduced updates (Frostig et al., 2014).
Handling Heavy-Tailed and Noisy Gradients: Robustification of gradient and Hessian via estimators such as Huber-aggregation or median-of-means manages outliers and non-Gaussian sampling noise (Ioannou et al., 2023).
Nonparametric Gradient Learning: If the loss admits smoothness in data, local polynomial regression can be exploited to approximate the gradient, yielding ERM solvers with improved oracle complexity when data dimension is low (Jadbabaie et al., 2020).

6. Limitations, Extensions, and Open Problems

Risk Monotonicity: Counter to expectation, ERM does not guarantee monotonic improvement in expected risk as sample size increases; "risk curves" for ERM may be non-monotonic even for standard loss/hypothesis pairs (linear regression, classification, density estimation). Risk-monotonicity, defined as

$\mathcal{F}$ 7

often fails for ERM (Loog et al., 2019).

Complexity-Driven Signal-Noise Separation: In dynamical, non-i.i.d. settings, the ability of ERM to recover the underlying signal is contingent on the entropy of the model class; zero-entropy (or bounded mean-width) ensures model selection is driven by signal, while high complexity classes can overfit additive noise (McGoff et al., 2016).
Consistency Under Misspecification and Structured Noise: Under heavy-tailed losses, only robustified (e.g., Catoni-based) ERM attains favorable concentration, suggesting the necessity of robustification in practical, non-sub-Gaussian regimes (Brownlees et al., 2014).
Function-Space Generalization: Modern over-parameterized models, such as deep neural nets, empirically benefit from extensions of ERM to function-space regularization (FRM), explaining generalization and robustness beyond classical statistical learning theory (Alet et al., 2024).

7. Summary Table: ERM Variants and Generalizations

Formulation	Objective Structure	Key Property/Context
Classical ERM	$\mathcal{F}$ 8	i.i.d. data, bounded/sub-Gaussian loss
Term (Tilted ERM)	$\mathcal{F}$ 9	Tunable robustness/fairness via $\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 0
Robust ERM (Catoni)	$\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 1, $\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 2 solves $\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 3	Finite-variance, heavy-tailed loss
Explainable ERM (EERM)	$\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 4	Subjective explainability-accuracy trade-off
Relational ERM	$\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 5 over sampled subgraphs	Dependent/relational (graph) data structures
Weighted ERM (ISWERM)	$\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 6 with $\hat f_{\rm ERM} = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^{n} \ell(f(x_i), y_i)$ 7 correcting for adaptivity	Off-policy, adaptively collected data
Functional RM (FRM)	Minimize average function-space divergence	Over-parameterized, structured noise, DNNs

Each of these extends the ERM principle to address specific statistical or application-driven desiderata, retaining or extending the original statistical optimality guarantees under new regimes and constraints.