Self-Paced Learning (SPL): Methods & Applications

Updated 21 January 2026

Self-paced learning is a machine learning paradigm that gradually introduces training samples from easy to hard using dynamic sample weighting.
It employs a self-paced regularizer and pace parameter to adaptively select samples, grounded in MM algorithms and nonconvex robust loss formulations.
SPL has broad applications including deep metric learning, distributed optimization, fairness-aware learning, robust hashing, and ensemble methods.

Self-paced learning (SPL) is a machine learning paradigm in which training proceeds from “easy” samples towards increasingly “hard” ones, mimicking the learning behavior observed in humans and animals. By introducing a dynamic sample-weighting mechanism—governed by a “self-paced” regularizer and a pace (age) parameter—SPL adaptively selects which data are emphasized at each training stage, which robustifies the learner against noise and provides a principled “easy-to-hard” curriculum within the objective. SPL is broadly applicable across supervised learning, unsupervised clustering, deep representation learning, robust hashing, ensemble methods, and large-scale distributed optimization. Its theoretical foundation is intimately connected to nonconvex robust loss minimization, majorization-minimization (MM) algorithms, and concave conjugacy theory. The following sections elaborate the mathematical formulations, theoretical underpinnings, algorithmic instantiations, and representative SPL extensions.

1. Mathematical Formulation and Latent Objective

At its core, SPL introduces continuous or binary sample weights $v_i \in [0,1]$ into the empirical risk minimization, together with a self-paced regularizer $f_\lambda(v)$ parameterized by a pace parameter $\lambda$ : $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ where $\ell_i(w)$ is the per-sample loss, $R(w)$ is a standard model regularizer, and $f_\lambda(v) = \sum_i g(v_i;\lambda)$ is chosen such that for each $i$ the minimizer $v_i^*(\ell,\lambda)$ (defined as $v_i^*(\ell,\lambda) = \arg\min_{v_i \in [0,1]} v_i \ell + g(v_i;\lambda)$ ) is nonincreasing in $f_\lambda(v)$ 0 and nondecreasing in $f_\lambda(v)$ 1, with $f_\lambda(v)$ 2 and $f_\lambda(v)$ 3 (Ma et al., 2017, Meng et al., 2015, Liu et al., 2018).

Typical choices for the self-paced regularizer include:

Hard selection: $f_\lambda(v)$ 4
Linear soft weighting: $f_\lambda(v)$ 5
Polynomial or mixture: $f_\lambda(v)$ 6 for $f_\lambda(v)$ 7 (Wang et al., 2017, Meng et al., 2015).

Alternating minimization in $f_\lambda(v)$ 8 on the above joint objective is equivalent to an MM procedure on the “implicit” or “latent” SPL objective: $f_\lambda(v)$ 9 where $\lambda$ 0 is concave and caps out for large $\lambda$ 1, directly conferring nonconvex robust loss properties (Meng et al., 2015, Liu et al., 2018, Ma et al., 2017).

2. Theoretical Foundations and Robustness

SPL’s alternating minimization scheme is theoretically justified via the MM perspective and Zangwill’s global convergence theorem. Under mild regularity conditions (e.g., regularizer coercivity, $\lambda$ 2 differentiability, continuity of $\lambda$ 3), the SPL iterates provably converge to a critical point of $\lambda$ 4 (Ma et al., 2017).

The concave nature of $\lambda$ 5 implies that SPL effectively minimizes a nonconvex robust penalty on the losses. Notably, hard and linear SPL correspond, respectively, to the capped- $\lambda$ 6 penalty and to MCP/SCAD class penalties in robust statistics:

$\lambda$ 7
$\lambda$ 8 (Meng et al., 2015, Liu et al., 2018). Thus, large-loss (outlier or noisy) samples contribute vanishingly small gradient, imparting robustness and implicit outlier rejection (Meng et al., 2015, Fan et al., 2016).

A central insight is the equivalence of SPL to a latent concave objective obtained via concave conjugacy theory: $\lambda$ 9 where $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 0 and $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 1 denotes concave conjugation. This provides constructive means for designing new SPL regularizers either by specifying the sample-weight function $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 2 or by convex conjugates (Liu et al., 2018, Fan et al., 2016).

3. SPL Variants and Algorithmic Extensions

Distributed and Large-Scale SPL

On large data, SPL’s coupled instance-weight updates hinder parallelization. Distributed SPL (DSPL) reformulates the problem as consensus optimization using ADMM, enabling per-batch updates of local models and weights, followed by global parameter aggregation (Zhang et al., 2018). The DSPL algorithm exhibits linear scaling with batch number and outperforms standard SPL and robust regression baselines under high corruption.

Deep and Metric-Driven SPL

SPL readily integrates with deep neural networks and metric learning:

Deep SPL for person re-identification: Splitting the triplet loss across a polynomial self-paced regularizer, with symmetric gradient regularization for balanced metric learning (Zhou et al., 2017).
SPL-ADVisE: Merges SPL with deep metric clustering (e.g., Magnet Loss) to implement both easiness and diversity priors, selecting samples that are easy and diverse in learned embedding space (Thangarasa et al., 2018).

Fairness-Informed SPL

Classical SPL may bias against underrepresented classes. Fairness-aware SPL variants (e.g., in deep regression forests) augment the selection score with predictive entropy, so that both easy and underrepresented (high-uncertainty) samples are included early (score: $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 3). This improves accuracy and fairness in regression forests for imbalanced continuous tasks (Pan et al., 2020, Pan et al., 2021).

Robust Hashing and Noisy Labels

SPL can be incorporated into robust hashing for cross-modal retrieval, using per-instance self-paced weights that are dynamically thresholded via the self-paced regularizer. The scheme filters out noisy-labeled pairs and gradually learns codes from clean to ambiguous data, significantly improving retrieval in noisy-label regimes (Pu et al., 3 Jan 2025).

Ensemble and Active Learning Regimes

Ensemble-based SPL (e.g., self-paced ensemble learning, SPEL) leverages ensemble confidence to construct more reliable pseudo-label curricula in unsupervised domain adaptation, outperforming single-model SPL baselines (Ristea et al., 2021). Active SPL frameworks alternate high-confidence pseudo-labeling (SPL) with selective active learning (AL) on ambiguous samples to minimize annotation cost and accelerate convergence, as demonstrated in progressive face identification (Lin et al., 2017).

4. SPL in Nonconvex and Structured Problems

SPL’s nonconvexity often leads to empirically favorable optimization properties. Starting with easy samples places the optimizer in a basin that models core high-confidence patterns before hard/noisy data can divert it to poor minima (Wu et al., 2023). For structured problems (e.g., multi-view clustering, SVM with neighborhood constraints), specialized SPL variants introduce spatial, group, or curriculum constraints into the regularizer or weighting process (Chen et al., 2019, Huang et al., 2021).

SPL also admits ODE-based path-following solutions (e.g., GAGA), which provide the entire spectrum of model solutions with respect to the age parameter, allowing more efficient model selection over $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 4 than brute-force grid search (Qu et al., 2022).

5. Connections to Robust Learning and Curriculum/Partial Order Priors

SPL unifies multiple robust and curriculum learning strategies:

Robustness: SPL’s penalty saturation parallels the capped, MCP, SCAD, log, and exponential loss forms in robust M-estimation, directly producing downweighting schemes that counteract outliers (Meng et al., 2015, Fan et al., 2016).
Curriculum and Prior Knowledge: Any known sample-order, smoothness, or group-prior can be encoded via additional constraints or regularization over $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 5, e.g., group-partial-order priors for weak-label scenarios, or spatial regularity for image data (Meng et al., 2015, Liu et al., 2018, Chen et al., 2019).

This flexibility enables SPL to serve as a generic substrate for integrating domain knowledge or problem structure into the training process.

6. Empirical Evidence and Applications

Extensive experiments across diverse domains support SPL’s empirical effectiveness:

Substantial accuracy improvements in PolSAR scene classification, facial age and pose estimation, cross-modal retrieval, speech/audio classification, and clustering on real and synthetic datasets (Chen et al., 2019, Pan et al., 2021, Pu et al., 3 Jan 2025, Ristea et al., 2021, Zhang et al., 2018).
Strong robustness to outlier/noisy samples, especially in high-corruption or weakly-annotated settings (Meng et al., 2015, Wang et al., 2017, Fan et al., 2016).
Mitigation of underrepresentation bias and improved fairness metrics in regression (Pan et al., 2021, Pan et al., 2020).
Efficient convergence even in nonconvex deep models due to the progressive easy-to-hard curriculum (Wu et al., 2023, Zhou et al., 2017, Thangarasa et al., 2018).

The SPL framework is actively maintained and extended, with open-source implementations for deep regression forests with fairness-aware SPL provided by the authors (Pan et al., 2021).

7. Practical Guidelines and Design Considerations

Regularizer design: Select the SPL regularizer (hard, linear, polynomial, mixture, or log) based on the required weighting smoothness and application.
Pace scheduling: Common strategies include starting $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 6 such that 50% of data are included, then increasing by 10% per stage, or adaptively setting class-wise thresholds for balanced curricula.
Integration with structured priors: Curriculum constraints, groupings, or fairness/entropy terms can be seamlessly incorporated through $\min_{w, \; v \in [0,1]^n} \; \sum_{i=1}^n v_i \, \ell_i(w) \;+\; f_\lambda(v) \;+\; R(w)$ 7 or auxiliary terms (Liu et al., 2018).
Optimization: Alternating minimization (or MM) remains the foundational approach, with each subproblem typically convex, though nonconvexities may arise in deep architectures.
Scalability: Distributed SPL (ADMM-based), ODE path-tracking, and block-wise algorithms enable efficient deployment on large-scale data (Zhang et al., 2018, Qu et al., 2022).

SPL’s theoretical guarantee of convergence to critical points of its latent robust objective, its algorithmic modularity, and its curriculum-encoded regularization position it as a central methodology for robust and interpretable machine learning across domains (Meng et al., 2015, Liu et al., 2018, Ma et al., 2017, Pan et al., 2021, Pu et al., 3 Jan 2025).