Papers
Topics
Authors
Recent
2000 character limit reached

Learned Loss Functions

Updated 23 October 2025
  • Learned loss functions are adaptive objective functions that derive their parameters from data using meta-learning and bilevel optimization, enabling flexible, task-specific loss design.
  • They employ various methodologies such as adversarial setups, Fenchel–Young losses, and surrogate networks to co-optimize model and loss parameters.
  • Empirical evidence shows that learned loss functions improve convergence, robustness, and accuracy compared to static, hand-crafted losses in tasks like semi-supervised and structured prediction.

A learned loss function is a loss function whose form and parameters are derived from data or meta-learning, as opposed to being fixed a priori or hand-engineered. This approach treats the loss as a learnable or adaptable object—either implicitly, as in adversarial setups, or explicitly, via parameterized neural networks, symbolic search, or bi-level optimization. Learned loss functions enable models to better capture complex task-specific requirements, handle non-differentiable user metrics, adapt to label noise, and improve both generalization and convergence, spanning applications from semi-supervised learning and robust classification to automated design in scientific modeling.

1. Principles and Mathematical Formulation

At its core, a learned loss function is parameterized and adapted to the target problem in one of several forms: neural networks, Bregman divergences, Fenchel–Young losses, Taylor polynomial expansions, or symbolic mathematical expressions. The learning mechanism involves optimizing not only the model parameters but also the loss function parameters, often in a nested bilevel or meta-learning setup: minϕ E(x,y)D[Lϕ(y,fθ(ϕ)(x))],whereθ(ϕ)=argminθE(x,y)Dtrain[Lϕ(y,fθ(x))]\min_{\phi} \ \mathbb{E}_{(x, y) \sim \mathcal{D}} \left[ \mathcal{L}_{\phi}(y, f_{\theta^*(\phi)}(x)) \right], \quad \text{where} \quad \theta^*(\phi) = \arg\min_{\theta} \mathbb{E}_{(x, y) \sim \mathcal{D}_\text{train}} \left[ \mathcal{L}_{\phi}(y, f_\theta(x)) \right] Here, Lϕ\mathcal{L}_\phi is the learned loss parameterized by ϕ\phi. The mathematical formulation depends on the specific learning paradigm:

  • Adversarial (DAN): The loss emerges as the optimal solution of a two-player minimax game, e.g.

V(J,P)=E(x,y)pdata[logJ(x,y)]+Expdata[log(1J(x,P(x)))]V(J, P) = \mathbb{E}_{(x, y) \sim p_{\text{data}}} \left[ \log J(x, y) \right] + \mathbb{E}_{x \sim p_{\text{data}}} \left[ \log(1 - J(x, P(x))) \right]

where JJ (the Judge) implicitly defines the loss for PP (the Predictor) (Santos et al., 2017).

  • Fenchel–Young losses: The loss is defined using the Fenchel conjugate of a regularizer Ω\Omega, unifying the construction of numerous classical losses:

LΩ(θ;y)=Ω(θ)+Ω(y)θ,yL_{\Omega}(\theta; y) = \Omega^*(\theta) + \Omega(y) - \langle \theta, y \rangle

enabling the creation of novel, convex, and often sparse-inducing losses (Blondel et al., 2019).

  • Surrogate Loss Networks: A surrogate ^β\hat{\ell}_\beta is learned (often as a neural network) to approximate task metrics that are non-smooth or aggregate over sets:

^(y,y^)=h(1Ni=1Ng(yi,y^i))\hat{\ell}(y, \hat{y}) = h\left(\frac{1}{N} \sum_{i=1}^N g(y_i, \hat{y}_i)\right)

where gg and hh are neural networks trained via bilevel optimization (Grabocka et al., 2019).

  • Symbolic/Neural Search: Hybrid frameworks search for loss structures using genetic programming or symbolic regression, then parameterize and fine-tune them via gradient-based meta-learning (Raymond et al., 2022, Raymond et al., 1 Mar 2024).
  • Online Meta-Learning: The loss is continuously updated “in lockstep” with model training, rather than being fixed after a short meta-training window, via a coupled optimization scheme (Raymond et al., 2023).

Key mathematical themes are the use of convex conjugacy, support functions of convex sets, Bregman divergences, and bilevel differentiable programming.

2. Methodological Frameworks and Implementation

Several dominant methodologies for learning loss functions have emerged:

  • Adversarial/Discriminative Loss Learning: DAN replaces the GAN generator with a predictor network and the discriminator with a Judge network, treating the loss as the adversary’s response in a discriminative game. This mechanism is notable for semi-supervised and structured prediction tasks (Santos et al., 2017).
  • Meta-Learning via Bi-level Optimization: Loss parameterization (using a neural net, polynomial, or symbolic expression) is learned in a bi-level framework, updating loss parameters to maximize validation or meta-test performance across tasks. Meta-optimizers unroll through multiple steps of base learner updates using the learned loss (Grabocka et al., 2019, Bechtle et al., 2019, Raymond et al., 2022, Raymond et al., 2023).
  • Neuro-symbolic Loss Search: Symbolic loss forms are generated by GP or symbolic regression; local parameters are then learned via gradient descent on a computational graph representation (Raymond et al., 2022, Raymond et al., 1 Mar 2024).
  • Error Loss Networks: The loss is defined as a parametric (e.g., RBF) neural network mapping from the error variable, often trained to match an information-theoretic criterion such as negative density or correntropy (Chen et al., 2021).
  • Fenchel–Young and Bregman Divergence-based Losses: Losses are constructed via convex duality; their properties (e.g., separation margin, sparsity) follow directly from the regularizer’s choice (Blondel et al., 2019, Nock et al., 2020).
  • Online/Adaptive Loss Update: Rather than learning a fixed loss during meta-training, the loss is updated at each model update step, always aligning with the model’s current learning needs and mitigating short-horizon bias (Raymond et al., 2023).

Implementation typically involves automatic differentiation over nested or unrolled optimization paths, care with stability (e.g., smooth leaky ReLU for loss net activation (Raymond et al., 2023)), and ensuring permutational or set-based invariance for batchwise metrics.

3. Empirical Performance and Comparative Impact

Learned loss functions have repeatedly demonstrated superior or competitive performance compared to standard hand-crafted losses:

  • Semi-supervised and Structured Prediction: Adversarially-learned losses in DAN outperform pairwise hinge and negative log-likelihood losses, especially with few labeled examples (e.g., MAP of 0.6891 for answer selection on SelQA with 10 labels vs. 0.4610 with hinge loss) (Santos et al., 2017).
  • Meta-Learned Losses: Methods such as ML³, VIABLE, and NPBML yield marked gains in few-shot and reinforcement learning—NPBML achieves 1-shot accuracy of 57.5% (4-CONV, mini-ImageNet) vs. ~49% for MAML, with larger margins in 5-shot and deeper networks (Raymond, 14 Jun 2024).
  • Surrogate Loss Networks: Learning set-based surrogate losses enhances AUC, F1, Jaccard, and other metrics across benchmark datasets, outperforming handcrafted surrogates and converging faster with improved sample efficiency (Grabocka et al., 2019).
  • Noise Robustness: Polynomial parameterizations or symbolic search with evolutionary strategies yield losses that resist label noise, outperforming robust baselines (e.g., Generalized Cross-Entropy) both in final performance and stability over architectures and datasets (Gao et al., 2021).
  • Adaptive Loss (Online): AdaLFL achieves persistently lower inference error, responding to training dynamics rather than being biased toward early-stage optimization; e.g., 9.03% error vs. 10.36% for VGG-16/CIFAR-10 (Raymond et al., 2023).
  • Feasibility and Optimality Trade-offs: In learning-based OPF, training with decision loss yields lower “regret” (cost gap versus optimal) and improved feasibility over MSE, especially under discontinuous mappings (Chen et al., 1 Feb 2024).

4. Theoretical Guarantees, Expressiveness, and Regularity

Theoretical contributions include:

  • Convexity and Calibration: Fenchel–Young losses are always convex w.r.t. the score argument; properness and strong properness connect directly to risk minimization and separation margins. For distribution learning, calibration constraints enable broader loss function classes to achieve concentration and robust risk minimization (Blondel et al., 2019, Haghtalab et al., 2019).
  • Universal Approximation: Permutation-invariant loss networks (Kolmogorov–Arnold representation) can, in principle, approximate any set-based loss function up to the complexity of the network (Grabocka et al., 2019).
  • Convergence Guarantees: BregmanTron generalizes SLIsotron and jointly learns the classifier and loss, with proven reduction in loss at every step under mild regularity conditions (Nock et al., 2020).
  • Loss Transferability: Learned losses (especially canonical ones from Bregman divergences) can be “recycled” across domains and tasks, with empirical support for successful loss transfer (Nock et al., 2020).
  • Combinatorial Calculus: The geometry and calculus of losses based on convex sets and their polars facilitate systematic construction, interpolation, and inverse mapping between loss functions, with closure under M-sums and dual operations (Williamson et al., 2022).

5. Practical Applications and Design Choices

Applications of learned loss functions span:

  • Natural Language Processing and Text Ranking: DAN is applied to answer selection and classification (Santos et al., 2017).
  • CV, Tabular, and Time Series: Loss function learning frameworks (e.g., EvoMAL, ML³) demonstrate impact in regression (sine, tabular), vision (MNIST, CIFAR-10), and NLP (surname classification), often improving sample efficiency or generalization (Raymond et al., 1 Mar 2024, Raymond et al., 2022).
  • Time-sensitive or Domain-specific Tasks: Adaptive losses are used for dynamic system control (RL setting), scientific computing (PINN for parameterized PDEs), and power systems (real-time OPF surrogates) (Bechtle et al., 2019, Psaros et al., 2021, Chen et al., 1 Feb 2024).
  • Robustness and Noise Handling: Taylor polynomial and RBF network-based learned losses excel in high-noise, label-corrupted, or outlier-prone settings (Gao et al., 2021, Chen et al., 2021).
  • Composite and Multi-modal Losses: Geometry-based frameworks permit the modular synthesis of new losses by combining existing ones using convex set calculus and M-sums, enabling customization for multi-modal and hierarchical tasks (Williamson et al., 2022).

Critical implementation considerations include the choice of parameterization (network, symbol, or convex function), preservation of convexity for stable training, computational efficiency, and alignment with target metrics.

6. Ongoing Challenges and Future Directions

Open challenges and research opportunities include:

  • Mitigating Short-Horizon Bias: Online or adaptive loss update schemes are crucial to ensure that learned losses generalize throughout the entire training trajectory rather than overfitting early-stage dynamics (Raymond et al., 2023).
  • Explaining and Interpreting Loss Landscapes: As learned losses grow in complexity, interpretability remains a challenge, motivating symbolic or white-box parameterizations (Raymond et al., 2022, Raymond et al., 1 Mar 2024).
  • Scalability to Multi-modal/Data-rich Problems: Efficient and expressive loss parameterizations are required to handle the combinatorial explosion in multi-modal and structured output settings (Elharrouss et al., 5 Apr 2025).
  • Integrating Physical and Domain Knowledge: In scientific applications (e.g., PINNs), losses must encode calibration, stationarity, and task-specific invariants, possibly via regularization or parameterized density family choices (Psaros et al., 2021).
  • Automated/NAS Joint Optimization: Simultaneous meta-learning of loss, optimizer, and architecture represents a frontier for model auto-tuning informed by procedural bias (Raymond, 14 Jun 2024).
  • Loss Function Compositionality: Convex set calculus and polarity theory provide the groundwork for systematic construction and dualization of losses, suggesting a general theory of loss combination for complex objectives (Williamson et al., 2022).

7. Summary Table: Representative Learned Loss Paradigms

Framework Loss Parameterization Key Mechanism
DAN Implicit (Judge as loss) Adversarial min-max game
Fenchel–Young Regularizer Ω, Fenchel dual Convex duality, Bregman
Surrogate Loss Networks Neural network (set-wise) Meta/bilevel optimization
Symbolic/GP Loss Search Expression Trees + Params Evolution + meta-gradient
Error Loss Networks (ELN) RBF basis, PDF matching Data-adaptive error mapping
Online AdaLFL Neural net, online update Joint θ/φ bilevel updates
BregmanTron Bregman divergence, link Simultaneous loss/classifier
PINN Meta-Learning Neural net/parameterized Outer MSE, physics prior

In conclusion, learned loss functions are a fundamental innovation that shifts loss design from static, handcrafted surrogates to flexible, data- or task-adaptive objectives. They utilize meta-learning, adversarial games, convex-analytic constructions, and evolutionary search to align the learning signal more closely with application- or metric-specific desiderata, yielding tangible benefits in robustness, generalization, efficiency, and the ability to optimize non-traditional metrics. As architectures grow in expressiveness and application domains widen, the development and theoretical understanding of learned losses will remain central to next-generation machine learning systems.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Learned Loss Functions.