ERM with f-Divergence Regularization
- The paper's main contribution is deriving a unique closed-form solution for ERM-fDR by integrating an f-divergence penalty with the empirical risk objective.
- The methodology unifies multiple divergence schemes, enabling a dual optimization framework that simplifies the infinite-dimensional problem to a one-dimensional scalar equation.
- Key implications include enhanced robustness, privacy, and fairness, with applications extending to semi-supervised learning and robust risk minimization.
Empirical Risk Minimization with f-Divergence Regularization (ERM-fDR) is a broad, mathematically rigorous framework in which a conventional empirical risk objective is augmented by a penalty termed an f-divergence between a candidate predictor-induced probability measure and a reference measure. The approach unifies a wide class of regularization and robustness schemes—including those based on Kullback–Leibler divergence, total variation, χ² divergence, Jensen–Shannon divergence, among others—each prescribing distinct inductive biases or robustness properties, and has substantial applications in privacy, fairness, robustness, and semi-supervised learning.
1. Mathematical Formulation and Solution Characterization
In ERM-fDR, the optimization problem is
where is the empirical risk (e.g. loss on the dataset at model parameter ), is a reference probability measure over model space, is the regularization strength, and denotes the f-divergence: with a strictly convex, differentiable function, .
Unique Solution and Closed-form Densities
Under the mild condition that is strictly convex and differentiable, the unique solution is characterized by a density with respect to of the form
where denotes the inverse of the derivative of , and the normalization term is chosen such that integrates to one:
Specific choices for recover well-known schemes:
- For , the solution (Gibbs measure) is ;
- For , the solution is ;
- For other (e.g. Jensen–Shannon, Hellinger), the solution is dictated by the corresponding (Daunas et al., 1 Feb 2024).
Support and Inductive Bias
A defining property—irrespective of —is that remains strictly positive on , so the learned measure cannot assign probability outside the support of the reference. This "support preservation" induces a hard inductive bias determined by , even dominating strong evidence from the data (Daunas et al., 1 Feb 2024, Daunas et al., 2023).
2. Dual Optimization and the Role of the Normalization Function
The primal ERM-fDR problem is computationally challenging due to the presence of a normalization constant. By duality, the problem can be reduced to a scalar optimization over a Lagrange multiplier , leveraging the Legendre–Fenchel (LF) transform: The dual objective function is
The minimizer (the normalization function) equates to the unique where
This dual approach is both theoretically rigorous and highly efficient computationally, as it reduces the often infinite-dimensional problem to a one-dimensional nonlinear equation (Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).
Nonlinear ODE for the Normalization Function
Through the implicit function theorem, a nonlinear ODE for the normalization constant is derived: where is the empirical risk under a reweighted measure , whose density with respect to is a function of and . Existence and strict monotonicity are guaranteed under mild regularity assumptions, and efficient numerical schemes (e.g., root-finding, ODE solvers) can be applied (Daunas et al., 5 Aug 2025).
3. Equivalence Transformations and Relation to Constrained Problems
An important theoretical result is that for any two choices of and (strictly convex, differentiable), a transformation exists such that
with defined by
This establishes that the choice of divergence primarily induces a transformation (reweighting) on the loss function, not a fundamentally distinct regularization effect (Daunas et al., 1 Feb 2024).
Furthermore, under nontriviality of the loss, the solution of the regularized ERM-fDR is equivalent to that of the constrained optimization: by setting the regularization parameter such that (Daunas et al., 20 Feb 2025).
4. Applications: Privacy, Fairness, Robustness, and Beyond
ERM-fDR encompasses several advanced applications across machine learning:
Differential Privacy
In differential privacy, adding an -divergence term to the ERM objective enables the practitioner to control the proximity of the learned distribution to an invariant or prior, which can be leveraged to "smooth" sensitivity and improve privacy-utility tradeoffs. When extending objective perturbation mechanisms, the addition of an -divergence term requires verifying convexity, differentiability, and bounded derivatives to ensure privacy guarantees and tractable sensitivity analysis (0912.0071).
Fairness via f-divergence
Fair empirical risk minimization with -divergence regularization (notably, in the f-FERM and related frameworks) penalizes statistical dependence between predictions and sensitive features via a divergence between the joint output-sensitive distribution and the product of marginals. The variational (convex dual) form of the divergence makes unbiased stochastic optimization possible and allows for a principled treatment of fairness–accuracy tradeoffs that are robust to mini-batch size, enforceable even with nontrivial fairness constraints, and provably generalize at the rate observed in unconstrained ERM (Baharlouei et al., 2023, Fukuchi et al., 2015).
Robustness and Tail Sensitivity
Replacing the expectation in ERM by an -divergence–induced risk measure enables "tailoring to the tails"—i.e., upweighting extreme losses—by choosing corresponding to a risk aversion profile encoded via a reference distribution, which is tightly connected to Orlicz and Lorentz norms. Ambiguity sets defined by -divergence lead to distributionally robust optimization schemes; for example, KL-divergence ambiguity sets result in solutions with subexponential tail control (Fröhlich et al., 2022, Coppens et al., 2023).
Semi-supervised and Noisy Label Learning
In semi-supervised self-training, divergence-based empirical risk functions—using -divergence or α-Rényi divergence—yield losses that are notably robust to noisy pseudo-labels (due to boundedness or insensitivity of certain divergences) and can regularize soft assignments toward uniformity, improving performance and stability over classical methods (Aminian et al., 1 May 2024).
Overparameterized Regimes and Functional Risk
Functional Risk Minimization (FRM) interprets ERM-fDR as a special case where loss is defined over distributions in function space. The natural f-divergence regularizer in FRM penalizes complexity in the space of functions, providing an explicit mechanism for choosing "simplest" solutions among many overfit predictors (Alet et al., 30 Dec 2024).
5. Technical and Practical Considerations
Several critical technical ingredients arise in ERM-fDR:
Support Restriction and Inductive Bias
Regardless of the empirical data, the solution support is always contained within the reference measure's support. This can be highly desirable (if encodes valid prior knowledge), but may result in severe bias if is poorly specified (Daunas et al., 2023, Daunas et al., 1 Feb 2024).
Choice of Divergence
The selection of influences both mathematical tractability and practical properties. For instance, KL-regularization yields exponential weights (Gibbs measure), while total variation or bounded divergences may confer robustness to noise (as in fairness- and robust-risk contexts) (Aminian et al., 1 May 2024).
Duality and Algorithmic Solvability
Utilizing the dual formulation not only yields theoretical insight but also greatly simplifies optimization and normalization via a one-dimensional root-finding (on ), and the ODE-based approach provides an efficient algorithm under broad regularity (Daunas et al., 5 Aug 2025, Daunas et al., 20 Feb 2025).
Table 1: Primal and Dual Formulations in ERM-fDR
Formulation | Optimization Variable | Constraint/Normalization |
---|---|---|
Primal | Probability measure | ; support in |
Dual | Scalar |
Conditions for Existence and Uniqueness
Strict convexity and differentiability of , together with "separability" of (nonconstancy over 's support) ensure solution existence, uniqueness, normalization function monotonicity, and dual-primal equivalence (Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).
6. Relationships to Alternative Regularization and Extensions
ERM-fDR subsumes and generalizes classical regularization paradigms:
- L₂-regularization can be seen as a specific (quadratic) f-divergence;
- Reverse relative entropy and Jeffrey’s divergence produce differing bias and tail properties through their corresponding solutions (Daunas et al., 1 Feb 2024);
- Alternative divergences induce equivalent regularization upon suitable transformation of the empirical risk (Daunas et al., 1 Feb 2024).
Type-II regularization, reversing arguments of the KL-divergence, leads to equivalent solutions (relative to Type-I) after a log transformation of the empirical risk. However, the bias toward the prior remains intrinsic in both cases (Daunas et al., 2023).
7. Summary and Theoretical Implications
ERM-fDR provides a formally unified and flexible foundation for incorporating structured inductive bias, robustness, privacy, and fairness in risk minimization problems. Its key theoretical implications include
- uniqueness and explicit closed-form density of solutions (in terms of and empirical risk),
- support preservation that imposes strong prior-dominated constraints,
- equivalence results between unconstrained and constrained divergence formulations,
- tractable dual optimization reducing to scalar non-linear equations or ODEs,
- the ability to tailor learning objectives to marginal, tail, or dependency properties using suitable divergence and reference measure choices.
Ongoing research has continued to elaborate efficient algorithms, practical calibrations, and connections to broader function-space learning and modern over-parameterized regimes (Alet et al., 30 Dec 2024, Daunas et al., 1 Feb 2024, Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025).
Key citations: (Daunas et al., 1 Feb 2024, Daunas et al., 20 Feb 2025, Daunas et al., 5 Aug 2025, 0912.0071, Fukuchi et al., 2015, Baharlouei et al., 2023, Daunas et al., 2023, Fröhlich et al., 2022, Coppens et al., 2023, Aminian et al., 1 May 2024, Alet et al., 30 Dec 2024)