Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 33 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 74 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 362 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ERM with f-Divergence Regularization

Updated 7 August 2025
  • The paper's main contribution is deriving a unique closed-form solution for ERM-fDR by integrating an f-divergence penalty with the empirical risk objective.
  • The methodology unifies multiple divergence schemes, enabling a dual optimization framework that simplifies the infinite-dimensional problem to a one-dimensional scalar equation.
  • Key implications include enhanced robustness, privacy, and fairness, with applications extending to semi-supervised learning and robust risk minimization.

Empirical Risk Minimization with f-Divergence Regularization (ERM-fDR) is a broad, mathematically rigorous framework in which a conventional empirical risk objective is augmented by a penalty termed an f-divergence between a candidate predictor-induced probability measure and a reference measure. The approach unifies a wide class of regularization and robustness schemes—including those based on Kullback–Leibler divergence, total variation, χ² divergence, Jensen–Shannon divergence, among others—each prescribing distinct inductive biases or robustness properties, and has substantial applications in privacy, fairness, robustness, and semi-supervised learning.

1. Mathematical Formulation and Solution Characterization

In ERM-fDR, the optimization problem is

minPQ Lz(θ)dP(θ)+λDf(PQ)\min_{P \ll Q}~ \int L_z(\theta) dP(\theta) + \lambda D_f(P\|Q)

where Lz(θ)L_z(\theta) is the empirical risk (e.g. loss on the dataset zz at model parameter θ\theta), QQ is a reference probability measure over model space, λ>0\lambda > 0 is the regularization strength, and Df(PQ)D_f(P\|Q) denotes the f-divergence: Df(PQ)=f(dPdQ)dQD_f(P\|Q) = \int f\left( \frac{dP}{dQ} \right) dQ with ff a strictly convex, differentiable function, f(1)=0f(1)=0.

Unique Solution and Closed-form Densities

Under the mild condition that ff is strictly convex and differentiable, the unique solution PP^* is characterized by a density with respect to QQ of the form

dPdQ(θ)=f˙1(β+Lz(θ)λ)\frac{dP^*}{dQ}(\theta) = \dot{f}^{-1}\left( - \frac{\beta + L_z(\theta)}{\lambda} \right)

where f˙1\dot{f}^{-1} denotes the inverse of the derivative of ff, and the normalization term β\beta is chosen such that PP^* integrates to one: f˙1(β+Lz(θ)λ)dQ(θ)=1\int \dot{f}^{-1}\left( - \frac{\beta + L_z(\theta)}{\lambda} \right) dQ(\theta) = 1

Specific choices for ff recover well-known schemes:

  • For f(x)=xlogxf(x) = x \log x, the solution (Gibbs measure) is exp(β+Lz(θ)λ)\exp\left( - \frac{\beta + L_z(\theta)}{\lambda} \right);
  • For f(x)=logxf(x) = -\log x, the solution is λ/(β+Lz(θ))\lambda / (\beta + L_z(\theta));
  • For other ff (e.g. Jensen–Shannon, Hellinger), the solution is dictated by the corresponding f˙1\dot{f}^{-1} (Daunas et al., 1 Feb 2024).

Support and Inductive Bias

A defining property—irrespective of ff—is that dP/dQdP^*/dQ remains strictly positive on supp(Q)\operatorname{supp}(Q), so the learned measure PP^* cannot assign probability outside the support of the reference. This "support preservation" induces a hard inductive bias determined by QQ, even dominating strong evidence from the data (Daunas et al., 1 Feb 2024, Daunas et al., 2023).

2. Dual Optimization and the Role of the Normalization Function

The primal ERM-fDR problem is computationally challenging due to the presence of a normalization constant. By duality, the problem can be reduced to a scalar optimization over a Lagrange multiplier β\beta, leveraging the Legendre–Fenchel (LF) transform: f(t)=supx{txf(x)}f^*(t) = \sup_{x} \{ t x - f(x) \} The dual objective function is

G(β)=λf(β+Lz(θ)λ)dQ(θ)+βG(\beta) = \lambda \int f^*\left(-\frac{\beta + L_z(\theta)}{\lambda}\right) dQ(\theta) + \beta

The minimizer β=NQ,z(λ)\beta = N_{Q,z}(\lambda) (the normalization function) equates to the unique β\beta where

f˙1(β+Lz(θ)λ)dQ(θ)=1\int \dot{f}^{-1}\left( - \frac{\beta + L_z(\theta)}{\lambda} \right) dQ(\theta) = 1

This dual approach is both theoretically rigorous and highly efficient computationally, as it reduces the often infinite-dimensional problem to a one-dimensional nonlinear equation (Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).

Nonlinear ODE for the Normalization Function

Through the implicit function theorem, a nonlinear ODE for the normalization constant is derived: NQ,z(λ)=λddλNQ,z(λ)Rz(PN)N_{Q,z}(\lambda) = \lambda \frac{d}{d\lambda} N_{Q,z}(\lambda) - R_z(P_N) where Rz(PN)R_z(P_N) is the empirical risk under a reweighted measure PNP_N, whose density with respect to QQ is a function of λ,NQ,z(λ),\lambda, N_{Q,z}(\lambda), and LzL_z. Existence and strict monotonicity are guaranteed under mild regularity assumptions, and efficient numerical schemes (e.g., root-finding, ODE solvers) can be applied (Daunas et al., 5 Aug 2025).

3. Equivalence Transformations and Relation to Constrained Problems

An important theoretical result is that for any two choices of ff and gg (strictly convex, differentiable), a transformation vv exists such that

minPQ LzdP+λDf(PQ)is equivalent tominPQ v(Lz)dP+λDg(PQ)\min_{P \ll Q}~ \int L_z dP + \lambda D_f(P\|Q) \quad\text{is equivalent to}\quad \min_{P \ll Q}~ \int v(L_z) dP + \lambda D_g(P\|Q)

with vv defined by

v(x)=λg˙(f˙1(NQ,z(λ)+xλ))NQ,z(λ)v(x) = \lambda \dot{g}\Big(\dot{f}^{-1}\left( -\frac{N_{Q,z}(\lambda) + x}{\lambda} \right)\Big) - N'_{Q,z}(\lambda)

This establishes that the choice of divergence primarily induces a transformation (reweighting) on the loss function, not a fundamentally distinct regularization effect (Daunas et al., 1 Feb 2024).

Furthermore, under nontriviality of the loss, the solution of the regularized ERM-fDR is equivalent to that of the constrained optimization: minPQ LzdPs.t.  Df(PQ)η\min_{P \ll Q}~ \int L_z dP \qquad \text{s.t.}~~ D_f(P\|Q) \leq \eta by setting the regularization parameter λ\lambda such that Df(PQ)=ηD_f(P^*\|Q) = \eta (Daunas et al., 20 Feb 2025).

4. Applications: Privacy, Fairness, Robustness, and Beyond

ERM-fDR encompasses several advanced applications across machine learning:

Differential Privacy

In differential privacy, adding an ff-divergence term to the ERM objective enables the practitioner to control the proximity of the learned distribution to an invariant or prior, which can be leveraged to "smooth" sensitivity and improve privacy-utility tradeoffs. When extending objective perturbation mechanisms, the addition of an ff-divergence term requires verifying convexity, differentiability, and bounded derivatives to ensure privacy guarantees and tractable sensitivity analysis (0912.0071).

Fairness via f-divergence

Fair empirical risk minimization with ff-divergence regularization (notably, in the f-FERM and related frameworks) penalizes statistical dependence between predictions and sensitive features via a divergence between the joint output-sensitive distribution and the product of marginals. The variational (convex dual) form of the divergence makes unbiased stochastic optimization possible and allows for a principled treatment of fairness–accuracy tradeoffs that are robust to mini-batch size, enforceable even with nontrivial fairness constraints, and provably generalize at the O(1/n)O(1/\sqrt{n}) rate observed in unconstrained ERM (Baharlouei et al., 2023, Fukuchi et al., 2015).

Robustness and Tail Sensitivity

Replacing the expectation in ERM by an ff-divergence–induced risk measure enables "tailoring to the tails"—i.e., upweighting extreme losses—by choosing ff corresponding to a risk aversion profile encoded via a reference distribution, which is tightly connected to Orlicz and Lorentz norms. Ambiguity sets defined by ff-divergence lead to distributionally robust optimization schemes; for example, KL-divergence ambiguity sets result in solutions with subexponential tail control (Fröhlich et al., 2022, Coppens et al., 2023).

Semi-supervised and Noisy Label Learning

In semi-supervised self-training, divergence-based empirical risk functions—using ff-divergence or α-Rényi divergence—yield losses that are notably robust to noisy pseudo-labels (due to boundedness or insensitivity of certain divergences) and can regularize soft assignments toward uniformity, improving performance and stability over classical methods (Aminian et al., 1 May 2024).

Overparameterized Regimes and Functional Risk

Functional Risk Minimization (FRM) interprets ERM-fDR as a special case where loss is defined over distributions in function space. The natural f-divergence regularizer in FRM penalizes complexity in the space of functions, providing an explicit mechanism for choosing "simplest" solutions among many overfit predictors (Alet et al., 30 Dec 2024).

5. Technical and Practical Considerations

Several critical technical ingredients arise in ERM-fDR:

Support Restriction and Inductive Bias

Regardless of the empirical data, the solution support is always contained within the reference measure's support. This can be highly desirable (if QQ encodes valid prior knowledge), but may result in severe bias if QQ is poorly specified (Daunas et al., 2023, Daunas et al., 1 Feb 2024).

Choice of Divergence

The selection of ff influences both mathematical tractability and practical properties. For instance, KL-regularization yields exponential weights (Gibbs measure), while total variation or bounded divergences may confer robustness to noise (as in fairness- and robust-risk contexts) (Aminian et al., 1 May 2024).

Duality and Algorithmic Solvability

Utilizing the dual formulation not only yields theoretical insight but also greatly simplifies optimization and normalization via a one-dimensional root-finding (on β\beta), and the ODE-based approach provides an efficient algorithm under broad regularity (Daunas et al., 5 Aug 2025, Daunas et al., 20 Feb 2025).

Table 1: Primal and Dual Formulations in ERM-fDR

Formulation Optimization Variable Constraint/Normalization
Primal Probability measure PQP \ll Q dP=1\int dP = 1; support in supp(Q)\operatorname{supp}(Q)
Dual Scalar β=NQ,z(λ)\beta = N_{Q,z}(\lambda) f˙1((β+Lz(θ))/λ)dQ=1\int \dot{f}^{-1}(-(\beta + L_z(\theta))/\lambda)dQ = 1

Conditions for Existence and Uniqueness

Strict convexity and differentiability of ff, together with "separability" of LzL_z (nonconstancy over QQ's support) ensure solution existence, uniqueness, normalization function monotonicity, and dual-primal equivalence (Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).

6. Relationships to Alternative Regularization and Extensions

ERM-fDR subsumes and generalizes classical regularization paradigms:

  • L₂-regularization can be seen as a specific (quadratic) f-divergence;
  • Reverse relative entropy and Jeffrey’s divergence produce differing bias and tail properties through their corresponding solutions (Daunas et al., 1 Feb 2024);
  • Alternative divergences induce equivalent regularization upon suitable transformation of the empirical risk (Daunas et al., 1 Feb 2024).

Type-II regularization, reversing arguments of the KL-divergence, leads to equivalent solutions (relative to Type-I) after a log transformation of the empirical risk. However, the bias toward the prior remains intrinsic in both cases (Daunas et al., 2023).

7. Summary and Theoretical Implications

ERM-fDR provides a formally unified and flexible foundation for incorporating structured inductive bias, robustness, privacy, and fairness in risk minimization problems. Its key theoretical implications include

  • uniqueness and explicit closed-form density of solutions (in terms of ff and empirical risk),
  • support preservation that imposes strong prior-dominated constraints,
  • equivalence results between unconstrained and constrained divergence formulations,
  • tractable dual optimization reducing to scalar non-linear equations or ODEs,
  • the ability to tailor learning objectives to marginal, tail, or dependency properties using suitable divergence and reference measure choices.

Ongoing research has continued to elaborate efficient algorithms, practical calibrations, and connections to broader function-space learning and modern over-parameterized regimes (Alet et al., 30 Dec 2024, Daunas et al., 1 Feb 2024, Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).


Key citations: (Daunas et al., 1 Feb 2024, Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025, 0912.0071, Fukuchi et al., 2015, Baharlouei et al., 2023, Daunas et al., 2023, Fröhlich et al., 2022, Coppens et al., 2023, Aminian et al., 1 May 2024, Alet et al., 30 Dec 2024)

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Empirical Risk Minimization with f-Divergence Regularization (ERM-fDR).