Fenchel–Young Loss

Updated 2 June 2026

Fenchel–Young loss is a class of convex loss functions defined via convex duality, unifying traditional losses like squared, logistic, and hinge.
It supports modular prediction maps and allows control over smoothness, margin, and sparsity, facilitating applications in structured prediction, variational inference, and robust modeling.
Its flexible design underpins efficient algorithms and ensures consistency, calibration, and regret bounds in both finite- and infinite-dimensional settings.

A Fenchel–Young loss is a broad class of convex loss functions defined via convex duality. Central to modern statistical learning, these losses generalize and unify numerous traditional objectives such as squared, logistic, hinge, sparsemax, and more structured and continuous-domain losses. Defined for any proper, closed, convex function or regularizer over a prediction space, the Fenchel–Young loss induces a canonical prediction map and enables control over smoothness, margin, sparsity, and regret properties. This flexibility underpins a diverse range of applications, including finite and infinite-dimensional estimation, structured prediction, inverse optimization, variational inference, robust modeling, and associative memory dynamics.

1. Definition, General Form, and Basic Properties

Let $f: \mathcal{U} \to \mathbb{R} \cup \{+\infty\}$ be a proper, closed, convex function with convex conjugate $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ ,

$f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$

For $u \in \mathcal{U}$ (inputs/parameters) and $v \in \mathcal{V}$ (targets/scores/dual outputs), the Fenchel–Young loss is defined as

$L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$

This “duality gap” is nonnegative by the Fenchel–Young inequality, and vanishes iff $v \in \partial f(u)$ (or $u \in \partial f^*(v)$ ).

Key properties:

Convexity in $u$ and $v$ separately; joint convexity when $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 0 is convex.
Gradient: For differentiable $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 1, $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 2 and $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 3.
Minimizer: $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 4 is minimized over $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 5 at the prediction $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 6; similarly $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 7.
When $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 8 is strictly convex and essentially smooth, $f^*: \mathcal{V} \to \mathbb{R} \cup \{+\infty\}$ 9 is differentiable and strongly convex, with Bregman divergence structure (Blondel et al., 2019, Sakaue et al., 2024).

2. Examples and Unified Interpretation

The Fenchel–Young loss framework subsumes a broad range of classical and modern losses (below $f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 0 is the target, $f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 1 model outputs):

Loss Function	Regularizer $f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 2	Prediction Link	FY Loss Expression (for $f^(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 3, $f^(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 4)
Squared	$f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 5	$f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 6	$f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 7
Softmax (logistic)	$f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 8	$f^*(v) = \sup_{u \in \mathcal{U}} \left\{ \langle u, v \rangle - f(u) \right\}.$ 9	$u \in \mathcal{U}$ 0
Sparsemax	$u \in \mathcal{U}$ 1	$u \in \mathcal{U}$ 2	$u \in \mathcal{U}$ 3
Hinge (perceptron)	$u \in \mathcal{U}$ 4	$u \in \mathcal{U}$ 5	$u \in \mathcal{U}$ 6
Tsallis- $u \in \mathcal{U}$ 7	$u \in \mathcal{U}$ 8	$u \in \mathcal{U}$ 9-entmax $v \in \mathcal{V}$ 0	$v \in \mathcal{V}$ 1

Here, $v \in \mathcal{V}$ 2 is an indicator, $v \in \mathcal{V}$ 3 and $v \in \mathcal{V}$ 4 are (generalized) entropies, and $v \in \mathcal{V}$ 5 is the probability simplex. This table illustrates the ability of the framework to continuously interpolate between hard (argmax) and smooth (softmax) responses by tuning, for example, the entropy regularizer (Blondel et al., 2018, Blondel et al., 2019, Rakotomandimby et al., 2024).

3. Regret, Margins, and Statistical Consistency

Fenchel–Young losses exhibit several central theoretical properties:

Separation margin: For losses generated from strongly convex $v \in \mathcal{V}$ 6, the loss is said to have margin $v \in \mathcal{V}$ 7 if $v \in \mathcal{V}$ 8 whenever the model output separates the target by at least $v \in \mathcal{V}$ 9. The margin can be characterized analytically in terms of entropy derivatives for separable entropies: $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 0 (Blondel et al., 2018). Losses with finite margin enable faster convergence in optimization and improved robustness (Bao et al., 7 Feb 2025).
Calibration and consistency: FY losses induce proper scoring rules on probabilistic outputs and admit Fisher consistency for the underlying target loss, provided the prediction link covers the simplex (Cao et al., 14 May 2025, Sakaue et al., 2024, Blondel et al., 2018).
Regret bounds: Surrogate excess risk in terms of FY loss tightly upper bounds true excess target loss. Smooth convex surrogates built from infimal convolution of negentropy and the Bayes risk provide linear surrogate regret bounds, circumventing the classical smoothness-vs-regret trade-off (Cao et al., 14 May 2025).
Online learning and OCO: FY losses are natural surrogates in online convex optimization for structured prediction and inverse problems, offering explicit $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 1 or gap-dependent $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 2 regret rates depending on problem geometry (e.g. in inverse linear optimization) (Sakaue et al., 23 Jan 2025).

4. Generalizations: Energy-Based and Infinite-Dimensional Settings

Fenchel–Young theory generalizes beyond bilinear or finite-dimensional settings. In the generalized energy-based setting, the bilinear pairing $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 3 is replaced by a general energy $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 4: $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 5 This abstract setup enables direct training of deep energy networks, structured prediction over complex domains, and avoids explicit differentiation through argmax/argmin solvers thanks to envelope theorems (Blondel et al., 2022).

For continuous or measure-valued predictions (e.g., variational inference, continuous attention, or inverse OT problems), FY losses are defined using convex functionals over the space of probability measures, retaining convexity, nonnegativity, differentiability, and providing sample-complexity bounds and local strong convexity after “sharpening” via additional data-dependent curvature (Martins et al., 2021, Andrade et al., 11 May 2025).

5. Optimization, Algorithmic, and Computational Aspects

FY losses admit efficient minimization and prediction procedures:

Prediction maps correspond to regularized argmax (proximal maps, energy minimization, etc.), which, in separable settings, reduce to efficient root-finding in $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 6 time for $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 7 classes (Blondel et al., 2018).
Gradient computation for generalized-Φ losses leverages envelope theorems, providing efficient backpropagation and avoiding expensive differentiation through argmax layers (Blondel et al., 2022).
Duality and Bregman structure: In regular settings (e.g., Legendre), $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 8 serves as a Bregman divergence between target and prediction (Blondel et al., 2019, Nielsen et al., 5 Mar 2026).
Surrogate risk minimization for target losses is compatible with linear decoding/probability estimation links, maintaining tight statistical and computational guarantees even in high dimensions or structured domains (Cao et al., 14 May 2025, Sakaue et al., 2024).

6. Key Applications and Recent Advances

Structured and sparse prediction: FY losses underpin CRFs, SparseMAP, and energy-based models for structured outputs, enabling convex surrogates for MAP and marginal inference (Santos et al., 2024, Blondel et al., 2022).
Inverse (linear and non-linear) optimization: FY losses quantify suboptimality and calibrate parameter estimation for inverse LPs and broader parametric inference, with robust gap-dependent guarantees (Sakaue et al., 23 Jan 2025, Li et al., 22 Feb 2025).
Variational inference and learning: Generalizing evidence lower bounds (ELBOs), FY variational methods support latent-variable models with adaptive sparsity, efficient EM variants, and tractable convex optimization (Sklaviadis et al., 14 Feb 2025).
Distributional robustness: Incorporation into Wasserstein DRO frameworks, using Lipschitz continuity of FY losses, enables tractable, safe robustification and recovers classical $L_{\mathrm{FY}}(u, v) = f(u) + f^*(v) - \langle u, v \rangle.$ 9-regularization and hinge losses as limits (Lin et al., 24 Feb 2026).
Associative memory and neural networks: The difference of two FY losses yields a general Hopfield energy functional supporting attractor dynamics, sparse retrieval, and normalization layers within a convex-dual framework (Santos et al., 2024).
Continuous domains (attention, density estimation): FY losses support deformed exponential families and sparse continuous distributions via Tsallis/entmax regularizers, with closed-form solutions for β-Gaussians and continuous fusedmax/smoothmax (Martins et al., 2021).

Fitzpatrick losses: These tighten the Fenchel–Young inequality by using the Fitzpatrick function, delivering strictly tighter convex surrogates associated with the same prediction link as the underlying FY loss (e.g., for softmax or sparsemax). Each Fitzpatrick loss is itself a modified FY loss for a target-dependent generator, combining tight calibration and computational tractability (Rakotomandimby et al., 2024).
Polar and geometric perspectives: Viewed through projective geometry, FY divergences can be generalized to “polar” Fenchel–Young divergences via matrix-induced quadratic polarities, unifying FY and Bregman divergences and their “total” variants in a geometric setting (Nielsen et al., 5 Mar 2026).
Smoothness–margin trade-offs: Construction via convolutional negentropy enables arbitrarily smooth FY surrogates with linear regret transfer, defying classical trade-off beliefs (Cao et al., 14 May 2025).
Margin vs. self-bounding: The faster rates for gradient descent under margin-based FY losses, compared to smooth self-bounding losses (e.g., softmax), are rooted in the separation-margin property rather than in any special local curvature (Bao et al., 7 Feb 2025).