Wasserstein-Bregman Divergence

Updated 1 October 2025

Wasserstein-Bregman divergence is defined as a generalization of the classical Wasserstein distance by incorporating Bregman divergences as transport cost functions.
It employs strictly convex functions to enable asymmetric, nonlinear penalization, enhancing applications in robust optimization and deep representation learning.
The framework supports efficient computational algorithms and duality principles, facilitating improved Bayesian inference and generative modeling.

The Wasserstein-Bregman divergence is a statistical and geometric generalization of the classical Wasserstein distance, incorporating Bregman divergences as transport costs and combining optimal transport theory with information geometry. The divergence arises naturally in statistics, machine learning, robust optimization, and deep representation learning, where it enables asymmetry, adaptivity, and refined control over penalization mechanisms compared to purely metric-based distances.

1. Mathematical Definition

The Wasserstein-Bregman divergence is constructed by transposing a Bregman divergence into the optimal transport framework. Given a strictly convex and continuously differentiable function $\phi\colon \mathbb{R}^d \to \mathbb{R}$ , the Bregman divergence between $x$ and $y$ is

$D_\phi(x, y) = \phi(x) - \phi(y) - \langle \nabla\phi(y), x - y \rangle.$

The Wasserstein-Bregman divergence between probability measures $P$ and $Q$ is

$W_{D_\phi}(P, Q) = \inf_{\gamma\in \Pi(P, Q)} \int D_\phi(x, y) d\gamma(x, y),$

where $\Pi(P, Q)$ is the set of all couplings of $P$ and $Q$ (Guo et al., 2017, Kainth et al., 2023, Guo et al., 2017).

When $\phi(x) = \|x\|^2$ , $D_\phi(x, y) = \|x - y\|^2$ , and $W_{D_\phi}(P, Q)$ coincides with the $L_2$ -Wasserstein distance (with quadratic cost). For other choices of $\phi$ (e.g., relative entropy, Itakura-Saito, $x\log x$ ), $D_\phi$ can be asymmetric and nonlinear, generalizing the transport geometry.

2. Fundamental Properties and Generalizations

Nonnegativity: $W_{D_\phi}(P, Q)\ge 0$ , and it vanishes only when $P=Q$ .
Asymmetry: Except when $\phi$ is quadratic, $D_\phi$ and therefore $W_{D_\phi}$ are not generally symmetric (Pesenti et al., 27 Nov 2024).
Metric Reduction: For quadratic $\phi$ , $W_{D_\phi}$ reduces to standard Wasserstein distance (Guo et al., 2017).
Convexity in the First Argument: $D_\phi(x, y)$ is convex in $x$ ; more generally, $W_{D_\phi}(P, Q)$ may exhibit convexity when viewed through optimal quantile functions (Pesenti et al., 27 Nov 2024).

Table: Bregman generator $\phi$ and corresponding cost function properties

$\phi(x)$	$D_\phi(x, y)$	Symmetry
$x^2$	$(x - y)^2$	Symmetric
$x\log x$	$x\log(x/y) + y - x$	Asymmetric

3. Probabilistic and Information-Geometric Interpretations

Bregman-Wasserstein divergences promote a generalized geometry for probability measures, extending the dualistic and dually flat structures from finite-dimensional Bregman manifolds to infinite-dimensional statistical manifolds (Kainth et al., 2023). For exponential families with cumulant generating function $\Omega$ , the canonical divergence coincides with the Bregman divergence generated by $\Omega$ . The divergence between two exponential family distributions is expressible via the Bregman-Wasserstein framework.

Generalized displacement interpolations compatible with Bregman geometry allow for the formulation of generalized geodesics, optimal transport maps, and barycenters, which are essential for Bayesian learning and variance-bias tradeoffs in statistical inference (Kainth et al., 2023).

4. Statistical and Computational Implications

A. Concentration and Asymptotic Results

Wasserstein-Bregman divergence admits novel concentration inequalities and asymptotic chi-squared type distributions for divergence between empirical and target distributions. For parametric families, asymptotic results of the form

$n D_\phi(\theta, \hat{\theta}_n) \rightarrow \frac{1}{2}\sum_{r}{\beta_r Z_r^2}$

hold, with $Z_r \sim \mathcal{N}(0,1)$ and $\beta_r$ eigenvalues of the Hessian and Fisher information product (Guo et al., 2017). These underpin ambiguity set calibration in robust optimization.

B. Computational Schemes

Algorithms designed for the quadratic (usual Wasserstein) case, including the Sinkhorn algorithm and primal-dual accelerated methods, can be adapted or extended using Bregman divergences, such as entropy-based kernels. Scaled entropy functions improve numerical stability and sparsity in solutions (Chambolle et al., 2022), and neural approaches (e.g., input convex neural networks) can approximate Bregman-Wasserstein optimal transport maps (Kainth et al., 2023, Cilingir et al., 2020).

C. Duality and Optimization

Relaxed Wasserstein distances—using Bregman divergence cost—possess dual linear programs analogous to Kantorovich–Rubinstein duality. The function class is modulated by Lipschitz constraints induced by $D_\phi$ (Guo et al., 2017). This renders (e.g.) GAN training more adaptive and stable.

5. Applications and Statistical Modeling

Robust Optimization: Construction of ambiguity sets as balls with respect to $W_{D_\phi}$ enables one to guard against misspecification, with penalties tuned for over- versus under-performance (Guo et al., 2017, Pesenti et al., 27 Nov 2024). Constraints of the form $W_{D_\phi}(P, P_{empirical}) \leq \epsilon$ calibrate both absolute and relative risk.
Bayesian Learning and Barycenters: Bregman-Wasserstein barycenters generalize Wasserstein barycenters, allowing for aggregation of posterior distributions and mixture models (Kainth et al., 2023).
Deep Representation Learning: Empirical Bregman divergence, parameterized via deep neural networks, is positioned as a flexible similarity measure in deep metric learning (Cilingir et al., 2020, Li et al., 2023). Divergence loss functions may be constructed for semi-supervised clustering or unsupervised generation.
Utility Maximization with Divergence Constraints: Imposing a Bregman-Wasserstein divergence constraint between a target (e.g., benchmark wealth distribution) and the actual payoff distribution yields quantile-based formulas for the optimal strategy. When $\phi$ is chosen non-quadratic, the penalty for deviation can be made asymmetric, addressing behavioral phenomena such as loss aversion (Pesenti et al., 27 Nov 2024).
Generative Modeling: RWGANs utilizing KL-type Bregman cost outperform classical WGANs due to improved adaption to data geometry, training stability, and sample quality (Guo et al., 2017).

6. Asymmetry, Interpretive Flexibility, and Behavioral Implications

The asymmetry intrinsic to the Bregman generator (for non-quadratic choices of $\phi$ ) is exploited in modeling settings where overperformance and underperformance should be penalized differently. For example, in portfolio theory, a BW divergence with $\phi(x)=x\log x$ applies a higher penalty for falling below a benchmark than for surpassing it. This formalizes investor preferences regarding relative risk and reward, and renders the optimal distribution of payoffs highly tunable.

Numerical examples in optimal payoff selection illustrate that BW-constrained solutions, particularly for asymmetric $\phi$ , maintain close alignment with benchmarks when exposed to stress scenarios, limiting excessive downside risk while allowing asymmetric room for upside performance (Pesenti et al., 27 Nov 2024).

7. Future Directions and Computational Methods

Empirical Bregman divergence learning can be extended to unsupervised and self-supervised contexts, constructed using generalized nonlinear model layers and convex link functions (e.g., Softplus) (Li et al., 2023). Bregman-Wasserstein JKO schemes discretize Riemannian gradient flows over probability measures and offer efficient numerical strategies (Kainth et al., 2023). There is ongoing interest in neural optimal transport algorithms, scaled entropy kernels for improved stability, and hybrid divergence measures bridging Bregman and Wasserstein structures.

A plausible implication is that further bridging of these methods with deep learning architectures and Bayesian inference will expand the role of Wasserstein-Bregman divergences in large-scale generative modeling, robust learning under uncertainty, and financial risk analytics. The flexibility in tuning divergence asymmetry and geometry, supported by rigorous statistical foundations and efficient solvers, positions Wasserstein-Bregman divergence as a core tool for modern probabilistic modeling and optimization.