Loss-Based Sampling: Methods & Applications

Updated 6 May 2026

Loss-based sampling is a technique that assigns sampling probabilities proportional to sample loss, focusing training on the most informative data.
It leverages metrics like uncertainty, TD errors, and local sensitivity to efficiently manage data selection in active learning, robust learning, and extreme classification.
The method has proven theoretical guarantees and is applied in areas such as reinforcement learning, quantile regression, and negative mining for improved convergence.

Loss-based sampling refers to a broad set of strategies in machine learning and statistical optimization where the probability of sampling data points is determined explicitly or implicitly by their contribution to a loss function. These methods exploit the fact that examples with higher loss, uncertainty, sensitivity, or related statistics can be the most informative for efficient model training, estimation, or data reduction. Loss-based sampling has found central utility in active learning, robust learning with noisy labels, experience replay in reinforcement learning, large-scale ERM, quantile regression, extreme classification, and metric learning. The mathematical basis and implementation details vary by application area, but the unifying principle is allocation of computational or labeling resources in proportion to (or in sophisticated response to) loss-based signals.

1. Core Principles and Mathematical Foundations

Loss-based sampling strategies select points with probability determined by their loss value (or a monotone transformation thereof) under a current model or a proxy/compressing model. In the most direct setting, the per-sample selection probability $p_i$ is set according to a normalized or clipped function of the loss $\ell_i = \ell(f(x_i), y_i)$ :

$p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$

where $P_{\min} > 0$ (to control variance) and $\lambda>0$ are hyperparameters (Mineiro et al., 2013). Selected points are typically assigned importance weights $1/p_i$ to ensure unbiased estimation in subsampled objectives.

More advanced strategies incorporate local or global sensitivity analysis, second-order (Hessian-based) leverage scores, or nonlinear transformation of loss for sampling distributions (Raj et al., 2019). In reinforcement learning, transition sampling can be modulated by TD error, $\delta$ , with priorities $p(i) \sim |\delta(i)|^\alpha$ (Fujimoto et al., 2020). Loss-based sampling can also arise via uncertainty or entropy proxies, as in active learning (Mussmann et al., 2018, Cai et al., 2019).

The expected gradient of the sampled objective can be matched to the original (full-batch) loss provided the appropriate loss transform or importance weighting is applied, establishing theoretical equivalence between certain types of loss-based sampling and loss modification (Fujimoto et al., 2020).

2. Canonical Algorithms and Implementations

Subsampling for Empirical Risk Minimization

A two-stage procedure is common: (1) fit a simple or compressed model $\tilde{h}$ to obtain losses $\ell_i$ , (2) subsample the data with $\ell_i = \ell(f(x_i), y_i)$ 0 (clipped, scaled) $\ell_i = \ell(f(x_i), y_i)$ 1, and (3) retrain the final model on the weighted subsample (Mineiro et al., 2013). This scheme provides excess risk bounds of order $\ell_i = \ell(f(x_i), y_i)$ 2 (with $\ell_i = \ell(f(x_i), y_i)$ 3 the expected subsample size), provided $\ell_i = \ell(f(x_i), y_i)$ 4 and $\ell_i = \ell(f(x_i), y_i)$ 5.

Sensitivity and Leverage-Score Based Schemes

In loss functions expressible as finite sums, the sensitivity $\ell_i = \ell(f(x_i), y_i)$ 6 of a point $\ell_i = \ell(f(x_i), y_i)$ 7 is the maximum (over feasible $\ell_i = \ell(f(x_i), y_i)$ 8) of its fractional contribution to the total loss. Sampling in proportion to local sensitivities within a trust-region can match the uniform approximation bounds for the objective in that region. These local sensitivities are efficiently approximated by ridge leverage scores of the Hessian of the quadratic Taylor expansion at the current point (Raj et al., 2019):

$\ell_i = \ell(f(x_i), y_i)$ 9

Gradient and Hessian evaluations scale as $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 0; per-iteration sample size $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 1 for $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 2.

Active Learning and Loss-Prediction

In sequence labeling (CWS, EHR mining), the NE-LP algorithm combines normalized entropy and a predicted loss from a shared BiLSTM-CRF/attention model (Cai et al., 2019). The acquisition score is

$p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 3

where $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 4 is normalized entropy and $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 5 is the output of a loss prediction network.

Sample Selection under Label Noise

In robust learning with noisy labels, the CNLCU approach uses robust estimators for the mean loss and confidence bounds (via soft- and hard-truncated estimators), selecting examples by small robust loss estimates minus an uncertainty penalty, thereby balancing exploitation and exploration. This algorithm is crucial for rescuing difficult or underrepresented but clean data points (Xia et al., 2021). Confidence intervals are constructed as:

$p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 6

and robust mean estimators exploit adaptive windowing and KNN-detected outlier removal.

Negative and Hard Example Mining

In extreme classification, retrieval, or recommendation (e.g., TRON), loss-based negative sampling specifically targets the hardest negatives per anchor or query by taking the top- $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 7 scoring negatives after neural inference over a large candidate pool. Only the most confounding negatives—that increase the (sampled) loss—are retained for gradient computation (Wilm et al., 2023). This regime significantly accelerates convergence and stabilizes gradients in large-output spaces.

3. Theoretical Guarantees and Equivalence Results

A key insight is the equivalence between non-uniform (loss/prioritized-based) sampling and uniform sampling under a transformed or reweighted loss (Fujimoto et al., 2020). In the context of prioritized experience replay (PER):

Sampling $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 8 with $p_i = \min\{1, \max\{P_{\min}, \lambda \ell_i\}\}$ 9 and using $P_{\min} > 0$ 0 yields the same expected update as
Uniform sampling with the loss $P_{\min} > 0$ 1.

The closed-form uniform-equivalent loss (for MSE or Huber) matches the effect of the sampling—thus explicit importance weights (IS) can be omitted if this loss is used. This generalizes to arbitrary per-example priorities beyond TD error.

For quantile regression, loss-based row sampling using $P_{\min} > 0$ 2-Lewis weights achieves $P_{\min} > 0$ 3 approximation guarantees for the quantile loss over all minimizers $P_{\min} > 0$ 4, with a sample complexity near-linear in the dimension $P_{\min} > 0$ 5, specifically $P_{\min} > 0$ 6 (Li et al., 2020).

In local sensitivity frameworks, sampling with probabilities proportional to local sensitivity ensures, with high probability, a $P_{\min} > 0$ 7 uniform approximation to the objective over a ball around the current iterate (Raj et al., 2019). These theoretical results ensure that aggressive data reduction via loss-based sampling does not induce uncontrolled bias.

4. Extensions to Structured, Sequential, and Noisy Settings

Loss-based sampling is prevalent in contexts with structured outputs, labeling noise, extreme data imbalance, or large-output spaces.

Noisy Labels: Sample selection by smallest estimated loss (and now, carefully considering uncertainty) is robust to label corruption, as large-loss data may be either noisy or genuinely underrepresented. Confidence-calibrated selection scores can improve generalization and minority class inclusion (Xia et al., 2021).
Memory Replay in RL: TD-error based sampling in prioritized experience replay, loss-adjusted replay via Huber-based priorities, and their analytic loss transformation greatly improve learning stability and efficiency (Fujimoto et al., 2020).
Extreme Classification/Long-Tail: Theoretical connections show that negative sampling schemas in extreme classification implicitly define a rescaled loss—modulating head/tail tradeoffs and allowing explicit control by choosing the sampling distribution $P_{\min} > 0$ 8 and weights $P_{\min} > 0$ 9 to target desired rebalancing (Rawat et al., 2021).

The paradigm extends to deep metric learning, where negative anchor selection based on distance in the embedding space (loss-informed selection) affects intra/inter-class separation and final performance (Rajoli et al., 2023).

5. Empirical Performance and Practical Guidelines

Loss-based sampling, when designed and tuned appropriately (e.g., clipping rates, choosing local sensitivity radii, or calibrating negative sample batch sizes), consistently outperforms uniform or random sampling:

In terascale DNA sequence classification, loss-proportional subsampling enabled the use of boosted trees and matched or exceeded full-data performance at $\lambda>0$ 0 of data size (Mineiro et al., 2013).
In session-based recommendation, hard negative mining by model-score top- $\lambda>0$ 1 selection accelerated convergence and improved click-through rates by $\lambda>0$ 2 in live A/B testing (Wilm et al., 2023).
In noisy-label learning, uncertainty-calibrated loss selection improved minority class accuracy by up to 30% over baselines (Xia et al., 2021).
Quantile regression sampling with Lewis weights scaled experiments to tens of millions of rows and exceeded prior methods both in accuracy and runtime (Li et al., 2020).

Hyperparameters such as the minimum sampling rate $\lambda>0$ 3, sampling slope $\lambda>0$ 4, or the hardness threshold $\lambda>0$ 5 in negative mining should be chosen to balance variance, computational cost, and statistical fidelity.

6. Unified Frameworks and Ongoing Directions

There is a growing appreciation that loss-based sampling and explicit loss modification are two facets of the same principle: one can design a target rebalancing objective, then choose the sampling distribution and weighting accordingly to implement it efficiently (Rawat et al., 2021, Fujimoto et al., 2020). This has led to unified schemes in large-output spaces and beyond, where sampling is optimized both for efficiency and to induce desired head/tail trade-offs. The field continues to expand toward more sophisticated priorities based on feature uncertainty, adaptive sensitivity, and structured outputs.

A plausible implication is that as models, datasets, and loss landscapes become even more complex, loss-based sampling and its equivalence to tailored losses will provide foundational tools for both scalable and robust learning.

Selected Key References with Domains:

Reference	Domain / Setting	Sampling Statistic
(Mineiro et al., 2013)	Empirical risk minimization, data reduction	Loss under proxy model
(Fujimoto et al., 2020)	RL/Experience replay	TD error
(Xia et al., 2021)	Robust/Noisy labels	Robust loss + uncertainty
(Raj et al., 2019)	Smooth/convex optimization	Local sensitivity, leverage
(Cai et al., 2019)	Active learning (CWS)	Entropy + predicted loss
(Li et al., 2020)	Quantile regression	Lewis weights of loss
(Rajoli et al., 2023)	Metric learning/embedding	Hardest negative distances
(Wilm et al., 2023)	Recommender systems	Top- $\lambda>0$ 6 model loss scores
(Rawat et al., 2021)	Extreme classification	Negative loss weighting
(Mussmann et al., 2018)	Uncertainty sampling	Margin/loss-based density