Self-Paced Learning with Diversity (SPLD)

Updated 9 November 2025

SPLD is a curriculum learning strategy that selects training samples based on both low loss (easiness) and diverse representation across semantic clusters or labels.
It augments traditional self-paced learning with a diversity-inducing regularizer, preventing overfitting and promoting balanced data coverage.
SPLD has demonstrated faster convergence and improved accuracy across domains such as image classification, cross-modal ranking, multi-label, and regression tasks.

Self-Paced Learning with Diversity (SPLD) defines a class of curriculum-based optimization methods in which the model is exposed to training examples that are both “easy” (according to the current model’s loss) and “diverse” (spread across semantic groups, clusters, queries, or label classes). This approach enhances traditional Self-Paced Learning (SPL) by addressing its tendency to over-focus on a homogenous set of easy examples. SPLD implementations are found across deep classification, cross-modal ranking, multi-label problems, and deep regression forests, demonstrating improvements in generalization, convergence speed, and robustness, especially in settings characterized by class imbalance or missing labels.

1. Theoretical Foundations of SPLD

SPLD extends classic Self-Paced Learning, which introduces samples into training in order of increasing difficulty based on model loss, but typically neglects diversity. In SPL, the key optimization is:

$\min_{\theta,\,w} \sum_{i=1}^N w_i \ell(x_i, y_i;\theta) - \lambda \sum_{i=1}^N w_i,\quad \text{subject to}\ w_i\in\{0,1\}$

where $\ell$ is per-sample loss, and $\lambda$ (“pace parameter”) increases during training.

SPLD augments this objective with a diversity-inducing regularizer constructed over partitions or groupings (clusters, queries, or label blocks):

$E(\theta, W;\lambda, \gamma) = \sum_{i=1}^N W_i \ell(y_i, f(x_i;\theta)) - \lambda \sum_{i=1}^N W_i - \gamma \|W\|_{2,1}$

Here, $\|W\|_{2,1}$ denotes a summation of group-wise (cluster or label block) $\ell_2$ -norms, encouraging spread across clusters. Other domain-specific SPLD variants explore sample-level uncertainty bonus (entropy) or block/group penalties.

The diversity term prevents overfitting to a narrow region of the data distribution and encourages broad coverage of the domain during curriculum progression, thereby mitigating collapse onto majority classes.

2. SPLD Methodologies in Core Domains

SPLD frameworks are instantiated in multiple learning contexts:

Image Classification (SPL-ADVisE)

In “Self-Paced Learning with Adaptive Deep Visual Embeddings” (Thangarasa et al., 2018), SPLD is realized by combining self-paced loss-based easiness with diversity imposed via clustering in a neural embedding (learned by Magnet Loss). The student model selects mini-batches of currently “easy” samples that are distributed across clusters in the deep representation space. The SPLD instance-level weights $W$ are defined as:

$W_i^k = \begin{cases} 1 & \ell_{ce}(y_i^k, f(x_i^k; \theta)) < \lambda + \gamma\left[\frac{1}{\sqrt{r}+\sqrt{r-1}}\right] \ 0 & \text{otherwise} \end{cases}$

where $r$ is the sample’s rank by loss within cluster $k$ .

In “Simple to Complex Cross-modal Learning to Rank” (Luo et al., 2017), SPLD operates on pairwise ranking quadruplets $(x_k, z_k, z_j, y_{kj})$ with block-wise weights $v_{kj}$ across queries. The objective introduces a group-norm penalty on the per-query block $v^{k}$ :

$\phi(v;\lambda, \gamma) = -\lambda \sum_{k, j\neq k} v_{kj} - \gamma \sum_k \sqrt{\sum_{j\neq k} v_{kj}}$

Optimization proceeds by alternating gradient steps for the embedding parameters $W$ and threshold-based updates for $v$ , ensuring that rankings included in training are both low-loss (easy) and dispersed across queries (diverse).

Multi-Label Learning

In “Self-Paced Multi-Label Learning with Diversity” (Seyedi et al., 2019), SPLD is applied to large-scale multi-label scenarios:

$\min_{W, P \in [0,1]^{l \times n}} \sum_{i=1}^l \sum_{j=1}^n p_{ij}\,\ell(y_{ij}, w_i^\top x_j) + \Phi(W, C) - \lambda\!\sum_{i,j} p_{ij} + \gamma\!\sum_{i=1}^l \|p^{(i)}\|_2$

where $p_{ij}$ are soft importance weights for label-instance pairs, and $\gamma$ penalizes concentrated selection within label blocks.

Deep Regression Forests

In “Self-Paced Deep Regression Forests with Consideration on Underrepresented Examples” (Pan et al., 2020), SPLD is generalized to continuous-output regression by directly rewarding high predictive entropy in the selection mechanism:

$\max_{\Theta, \Pi, v \in [0,1]^n} \sum_i v_i [\log p_F(y_i|x_i;\Theta,\Pi) + \gamma H_i ] + \zeta \sum_i \log(v_i + \zeta/\lambda)$

High-entropy samples, often from underrepresented regions, are explicitly favored during early curriculum stages.

3. Optimization Strategies and Algorithmic Structure

All SPLD methods follow a block-coordinate or alternating optimization strategy, typically alternating between:

Model parameter update: Fix inclusion weights (W, v, or P) and minimize the training objective with weighted loss (standard SGD/backprop for deep models, closed-form for linear cases).
Weight update: For fixed model parameters, solve the subproblem in W/v/P, often decomposed into sorting sample losses within groups and thresholding by combined criteria on loss, diversity, and sometimes additional terms (e.g., entropy, fairness).

Closed-form or easily computable updates for W/v/P are available due to the convexity or quasi-convexity in the respective subproblems, even though the global objectives are non-convex due to deep parametrizations or group-norm penalties.

Curriculum pace (λ) is increased and diversity weight (γ) is decreased according to pre-set schedules (e.g., geometric multipliers per epoch), ensuring the curriculum gradually includes harder and more diverse examples.

4. Empirical Evaluation and Quantitative Impact

Extensive comparison to SPL, random sampling, and domain-specific SOTA baselines substantiates SPLD’s value:

Image classification (Thangarasa et al., 2018):
- FashionMNIST: +0.95–1.35% test accuracy, faster convergence.
- SVHN: ~20% fewer mini-batch steps to peak ~97.7% accuracy.
- CIFAR-10: +1.17% over SPLD, +2.22% over random sampling.
- CIFAR-100: +3.81% over SPLD, +4.75% over random.
Cross-modal ranking (Luo et al., 2017):
- Pascal’07 image→text: mAP@5≈83.6% vs. next best 79.7%.
- mAP gains of 3–5% on NUS-WIDE and Wiki.
Multi-label learning (Seyedi et al., 2019):
- Outperforms LEML, ML-LRC, GLOCAL in 91.7% of 72 benchmark combinations.
- Example (Business, 30% observed): Ranking Loss .044 (SPLD) vs .054, .061, .063 (baselines).
Deep regression forests (Pan et al., 2020):
- Morph II (age): MAE 2.17 (DRFs) → 1.91 (SPUDRFs).
- BIWI head pose: MAE 1.44 (DRFs) → 1.18 (SPUDRFs), plain self-paced (SP-DRFs) deteriorates to 2.08 without diversity bonus.

Ablation studies uniformly indicate that disabling the diversity component ( $\gamma=0$ ) significantly reduces performance improvements, slows convergence, and results in models converging to biased or over-concentrated solutions.

5. Practical Implementation and Hyperparameter Sensitivity

SPLD approaches require several hyperparameters:

Pace parameter (λ): dictates the maximal loss (or minimal likelihood, in regression) of included samples at each stage. Typically initialized low and increased per epoch.
Diversity weight (γ): determines the strength of the diversity regularizer. Initialized relatively high to maximize early exposure to different modes/labels, then decreased.
Learning rates, batch sizes, and for embedding-based SPLD: specifics of clustering (number of clusters $K$ ), embedding type (e.g., Magnet Loss for deep visual SPLD).

Grid searches or validation sweeps are reported for $\lambda$ and $\gamma$ , with moderate initial $\lambda$ and $\gamma$ in $[1,10]$ yielding best results; both parameters significantly affect sample coverage and final generalization.

Algorithmic complexity per iteration typically scales with the number of active (selected) samples, owing to early epochs including only the easiest, most diverse instances, making SPLD computationally efficient for moderately large datasets.

6. Extensions, Limitations, and Generalization

SPLD frameworks are adaptable to diverse modeling setups wherever an “easiness” score and a sample grouping structure (clusters, labels, queries, predictive uncertainty) can be defined:

Direct application to continuous labels (regression) is feasible when predictive entropy can quantify diversity or rarity.
For multi-label or multi-task settings, per-block or per-label $\ell_2$ -norm penalties ensure diversity at the output level.
Group assignment can be static (feature clustering, label categories) or dynamic (embedding learning with periodic reclustering).

Limitations include empirical tuning of hyperparameters (pace/diversity schedules, clustering counts/intervals), potential sensitivity to the initial grouping structure (for static clusters), and lack of formalized bias or fairness metrics in some existing implementations (notably in (Pan et al., 2020)). No global optimality or convergence rate guarantees beyond local stationary condition are stated for deep nonconvex instantiations; block coordinate ascent convergence is as per standard schemes for bi-convex or difference-of-convex objectives.

A plausible implication is SPLD’s flexibility in accommodating alternative groupings (semantic, metric, or uncertainty-based) and selection mechanisms, suggesting applicability to a range of domains facing imbalanced data, missing labels, or the need for robustness in curriculum learning.

PDF Markdown Chat (Pro)

References (4)

Self-Paced Learning with Adaptive Deep Visual Embeddings (2018)

Simple to Complex Cross-modal Learning to Rank (2017)

Self-Paced Multi-Label Learning with Diversity (2019)

Self-Paced Deep Regression Forests with Consideration on Underrepresented Examples (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Self-Paced Learning with Diversity (SPLD).