ERM with f-Divergence Regularization

Updated 7 August 2025

The paper's main contribution is deriving a unique closed-form solution for ERM-fDR by integrating an f-divergence penalty with the empirical risk objective.
The methodology unifies multiple divergence schemes, enabling a dual optimization framework that simplifies the infinite-dimensional problem to a one-dimensional scalar equation.
Key implications include enhanced robustness, privacy, and fairness, with applications extending to semi-supervised learning and robust risk minimization.

Empirical Risk Minimization with f-Divergence Regularization (ERM-fDR) is a broad, mathematically rigorous framework in which a conventional empirical risk objective is augmented by a penalty termed an f-divergence between a candidate predictor-induced probability measure and a reference measure. The approach unifies a wide class of regularization and robustness schemes—including those based on Kullback–Leibler divergence, total variation, χ² divergence, Jensen–Shannon divergence, among others—each prescribing distinct inductive biases or robustness properties, and has substantial applications in privacy, fairness, robustness, and semi-supervised learning.

1. Mathematical Formulation and Solution Characterization

In ERM-fDR, the optimization problem is

$\min_{P \ll Q}~ \int L_z(\theta) dP(\theta) + \lambda D_f(P\|Q)$

where $L_z(\theta)$ is the empirical risk (e.g. loss on the dataset $z$ at model parameter $\theta$ ), $Q$ is a reference probability measure over model space, $\lambda > 0$ is the regularization strength, and $D_f(P\|Q)$ denotes the f-divergence: $D_f(P\|Q) = \int f\left( \frac{dP}{dQ} \right) dQ$ with $f$ a strictly convex, differentiable function, $f(1)=0$ .

Unique Solution and Closed-form Densities

Under the mild condition that $f$ is strictly convex and differentiable, the unique solution $P^*$ is characterized by a density with respect to $Q$ of the form

$\frac{dP^*}{dQ}(\theta) = \dot{f}^{-1}\left( - \frac{\beta + L_z(\theta)}{\lambda} \right)$

where $\dot{f}^{-1}$ denotes the inverse of the derivative of $f$ , and the normalization term $\beta$ is chosen such that $P^*$ integrates to one: $\int \dot{f}^{-1}\left( - \frac{\beta + L_z(\theta)}{\lambda} \right) dQ(\theta) = 1$

Specific choices for $f$ recover well-known schemes:

For $f(x) = x \log x$ , the solution (Gibbs measure) is $\exp\left( - \frac{\beta + L_z(\theta)}{\lambda} \right)$ ;
For $f(x) = -\log x$ , the solution is $\lambda / (\beta + L_z(\theta))$ ;
For other $f$ (e.g. Jensen–Shannon, Hellinger), the solution is dictated by the corresponding $\dot{f}^{-1}$ (Daunas et al., 1 Feb 2024).

Support and Inductive Bias

A defining property—irrespective of $f$ —is that $dP^*/dQ$ remains strictly positive on $\operatorname{supp}(Q)$ , so the learned measure $P^*$ cannot assign probability outside the support of the reference. This "support preservation" induces a hard inductive bias determined by $Q$ , even dominating strong evidence from the data (Daunas et al., 1 Feb 2024, Daunas et al., 2023).

2. Dual Optimization and the Role of the Normalization Function

The primal ERM-fDR problem is computationally challenging due to the presence of a normalization constant. By duality, the problem can be reduced to a scalar optimization over a Lagrange multiplier $\beta$ , leveraging the Legendre–Fenchel (LF) transform: $f^*(t) = \sup_{x} \{ t x - f(x) \}$ The dual objective function is

$G(\beta) = \lambda \int f^*\left(-\frac{\beta + L_z(\theta)}{\lambda}\right) dQ(\theta) + \beta$

The minimizer $\beta = N_{Q,z}(\lambda)$ (the normalization function) equates to the unique $\beta$ where

$\int \dot{f}^{-1}\left( - \frac{\beta + L_z(\theta)}{\lambda} \right) dQ(\theta) = 1$

This dual approach is both theoretically rigorous and highly efficient computationally, as it reduces the often infinite-dimensional problem to a one-dimensional nonlinear equation (Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).

Nonlinear ODE for the Normalization Function

Through the implicit function theorem, a nonlinear ODE for the normalization constant is derived: $N_{Q,z}(\lambda) = \lambda \frac{d}{d\lambda} N_{Q,z}(\lambda) - R_z(P_N)$ where $R_z(P_N)$ is the empirical risk under a reweighted measure $P_N$ , whose density with respect to $Q$ is a function of $\lambda, N_{Q,z}(\lambda),$ and $L_z$ . Existence and strict monotonicity are guaranteed under mild regularity assumptions, and efficient numerical schemes (e.g., root-finding, ODE solvers) can be applied (Daunas et al., 5 Aug 2025).

3. Equivalence Transformations and Relation to Constrained Problems

An important theoretical result is that for any two choices of $f$ and $g$ (strictly convex, differentiable), a transformation $v$ exists such that

$\min_{P \ll Q}~ \int L_z dP + \lambda D_f(P\|Q) \quad\text{is equivalent to}\quad \min_{P \ll Q}~ \int v(L_z) dP + \lambda D_g(P\|Q)$

with $v$ defined by

$v(x) = \lambda \dot{g}\Big(\dot{f}^{-1}\left( -\frac{N_{Q,z}(\lambda) + x}{\lambda} \right)\Big) - N'_{Q,z}(\lambda)$

This establishes that the choice of divergence primarily induces a transformation (reweighting) on the loss function, not a fundamentally distinct regularization effect (Daunas et al., 1 Feb 2024).

Furthermore, under nontriviality of the loss, the solution of the regularized ERM-fDR is equivalent to that of the constrained optimization: $\min_{P \ll Q}~ \int L_z dP \qquad \text{s.t.}~~ D_f(P\|Q) \leq \eta$ by setting the regularization parameter $\lambda$ such that $D_f(P^*\|Q) = \eta$ (Daunas et al., 20 Feb 2025).

4. Applications: Privacy, Fairness, Robustness, and Beyond

ERM-fDR encompasses several advanced applications across machine learning:

Differential Privacy

In differential privacy, adding an $f$ -divergence term to the ERM objective enables the practitioner to control the proximity of the learned distribution to an invariant or prior, which can be leveraged to "smooth" sensitivity and improve privacy-utility tradeoffs. When extending objective perturbation mechanisms, the addition of an $f$ -divergence term requires verifying convexity, differentiability, and bounded derivatives to ensure privacy guarantees and tractable sensitivity analysis (0912.0071).

Fairness via f-divergence

Fair empirical risk minimization with $f$ -divergence regularization (notably, in the f-FERM and related frameworks) penalizes statistical dependence between predictions and sensitive features via a divergence between the joint output-sensitive distribution and the product of marginals. The variational (convex dual) form of the divergence makes unbiased stochastic optimization possible and allows for a principled treatment of fairness–accuracy tradeoffs that are robust to mini-batch size, enforceable even with nontrivial fairness constraints, and provably generalize at the $O(1/\sqrt{n})$ rate observed in unconstrained ERM (Baharlouei et al., 2023, Fukuchi et al., 2015).

Robustness and Tail Sensitivity

Replacing the expectation in ERM by an $f$ -divergence–induced risk measure enables "tailoring to the tails"—i.e., upweighting extreme losses—by choosing $f$ corresponding to a risk aversion profile encoded via a reference distribution, which is tightly connected to Orlicz and Lorentz norms. Ambiguity sets defined by $f$ -divergence lead to distributionally robust optimization schemes; for example, KL-divergence ambiguity sets result in solutions with subexponential tail control (Fröhlich et al., 2022, Coppens et al., 2023).

Semi-supervised and Noisy Label Learning

In semi-supervised self-training, divergence-based empirical risk functions—using $f$ -divergence or α-Rényi divergence—yield losses that are notably robust to noisy pseudo-labels (due to boundedness or insensitivity of certain divergences) and can regularize soft assignments toward uniformity, improving performance and stability over classical methods (Aminian et al., 1 May 2024).

Overparameterized Regimes and Functional Risk

Functional Risk Minimization (FRM) interprets ERM-fDR as a special case where loss is defined over distributions in function space. The natural f-divergence regularizer in FRM penalizes complexity in the space of functions, providing an explicit mechanism for choosing "simplest" solutions among many overfit predictors (Alet et al., 30 Dec 2024).

5. Technical and Practical Considerations

Several critical technical ingredients arise in ERM-fDR:

Support Restriction and Inductive Bias

Regardless of the empirical data, the solution support is always contained within the reference measure's support. This can be highly desirable (if $Q$ encodes valid prior knowledge), but may result in severe bias if $Q$ is poorly specified (Daunas et al., 2023, Daunas et al., 1 Feb 2024).

Choice of Divergence

The selection of $f$ influences both mathematical tractability and practical properties. For instance, KL-regularization yields exponential weights (Gibbs measure), while total variation or bounded divergences may confer robustness to noise (as in fairness- and robust-risk contexts) (Aminian et al., 1 May 2024).

Duality and Algorithmic Solvability

Utilizing the dual formulation not only yields theoretical insight but also greatly simplifies optimization and normalization via a one-dimensional root-finding (on $\beta$ ), and the ODE-based approach provides an efficient algorithm under broad regularity (Daunas et al., 5 Aug 2025, Daunas et al., 20 Feb 2025).

Table 1: Primal and Dual Formulations in ERM-fDR

Formulation	Optimization Variable	Constraint/Normalization
Primal	Probability measure $P \ll Q$	$\int dP = 1$ ; support in $\operatorname{supp}(Q)$
Dual	Scalar $\beta = N_{Q,z}(\lambda)$	$\int \dot{f}^{-1}(-(\beta + L_z(\theta))/\lambda)dQ = 1$

Conditions for Existence and Uniqueness

Strict convexity and differentiability of $f$ , together with "separability" of $L_z$ (nonconstancy over $Q$ 's support) ensure solution existence, uniqueness, normalization function monotonicity, and dual-primal equivalence (Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).

6. Relationships to Alternative Regularization and Extensions

ERM-fDR subsumes and generalizes classical regularization paradigms:

L₂-regularization can be seen as a specific (quadratic) f-divergence;
Reverse relative entropy and Jeffrey’s divergence produce differing bias and tail properties through their corresponding solutions (Daunas et al., 1 Feb 2024);
Alternative divergences induce equivalent regularization upon suitable transformation of the empirical risk (Daunas et al., 1 Feb 2024).

Type-II regularization, reversing arguments of the KL-divergence, leads to equivalent solutions (relative to Type-I) after a log transformation of the empirical risk. However, the bias toward the prior remains intrinsic in both cases (Daunas et al., 2023).

7. Summary and Theoretical Implications

ERM-fDR provides a formally unified and flexible foundation for incorporating structured inductive bias, robustness, privacy, and fairness in risk minimization problems. Its key theoretical implications include

uniqueness and explicit closed-form density of solutions (in terms of $f$ and empirical risk),
support preservation that imposes strong prior-dominated constraints,
equivalence results between unconstrained and constrained divergence formulations,
tractable dual optimization reducing to scalar non-linear equations or ODEs,
the ability to tailor learning objectives to marginal, tail, or dependency properties using suitable divergence and reference measure choices.

Ongoing research has continued to elaborate efficient algorithms, practical calibrations, and connections to broader function-space learning and modern over-parameterized regimes (Alet et al., 30 Dec 2024, Daunas et al., 1 Feb 2024, Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).

Key citations: (Daunas et al., 1 Feb 2024, Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025, 0912.0071, Fukuchi et al., 2015, Baharlouei et al., 2023, Daunas et al., 2023, Fröhlich et al., 2022, Coppens et al., 2023, Aminian et al., 1 May 2024, Alet et al., 30 Dec 2024)