Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Self-Paced Learning with Diversity (SPLD)

Updated 9 November 2025
  • SPLD is a curriculum learning strategy that selects training samples based on both low loss (easiness) and diverse representation across semantic clusters or labels.
  • It augments traditional self-paced learning with a diversity-inducing regularizer, preventing overfitting and promoting balanced data coverage.
  • SPLD has demonstrated faster convergence and improved accuracy across domains such as image classification, cross-modal ranking, multi-label, and regression tasks.

Self-Paced Learning with Diversity (SPLD) defines a class of curriculum-based optimization methods in which the model is exposed to training examples that are both “easy” (according to the current model’s loss) and “diverse” (spread across semantic groups, clusters, queries, or label classes). This approach enhances traditional Self-Paced Learning (SPL) by addressing its tendency to over-focus on a homogenous set of easy examples. SPLD implementations are found across deep classification, cross-modal ranking, multi-label problems, and deep regression forests, demonstrating improvements in generalization, convergence speed, and robustness, especially in settings characterized by class imbalance or missing labels.

1. Theoretical Foundations of SPLD

SPLD extends classic Self-Paced Learning, which introduces samples into training in order of increasing difficulty based on model loss, but typically neglects diversity. In SPL, the key optimization is:

minθ,wi=1Nwi(xi,yi;θ)λi=1Nwi,subject to wi{0,1}\min_{\theta,\,w} \sum_{i=1}^N w_i \ell(x_i, y_i;\theta) - \lambda \sum_{i=1}^N w_i,\quad \text{subject to}\ w_i\in\{0,1\}

where \ell is per-sample loss, and λ\lambda (“pace parameter”) increases during training.

SPLD augments this objective with a diversity-inducing regularizer constructed over partitions or groupings (clusters, queries, or label blocks):

E(θ,W;λ,γ)=i=1NWi(yi,f(xi;θ))λi=1NWiγW2,1E(\theta, W;\lambda, \gamma) = \sum_{i=1}^N W_i \ell(y_i, f(x_i;\theta)) - \lambda \sum_{i=1}^N W_i - \gamma \|W\|_{2,1}

Here, W2,1\|W\|_{2,1} denotes a summation of group-wise (cluster or label block) 2\ell_2-norms, encouraging spread across clusters. Other domain-specific SPLD variants explore sample-level uncertainty bonus (entropy) or block/group penalties.

The diversity term prevents overfitting to a narrow region of the data distribution and encourages broad coverage of the domain during curriculum progression, thereby mitigating collapse onto majority classes.

2. SPLD Methodologies in Core Domains

SPLD frameworks are instantiated in multiple learning contexts:

Image Classification (SPL-ADVisE)

In “Self-Paced Learning with Adaptive Deep Visual Embeddings” (Thangarasa et al., 2018), SPLD is realized by combining self-paced loss-based easiness with diversity imposed via clustering in a neural embedding (learned by Magnet Loss). The student model selects mini-batches of currently “easy” samples that are distributed across clusters in the deep representation space. The SPLD instance-level weights WW are defined as:

Wik={1ce(yik,f(xik;θ))<λ+γ[1r+r1] 0otherwiseW_i^k = \begin{cases} 1 & \ell_{ce}(y_i^k, f(x_i^k; \theta)) < \lambda + \gamma\left[\frac{1}{\sqrt{r}+\sqrt{r-1}}\right] \ 0 & \text{otherwise} \end{cases}

where rr is the sample’s rank by loss within cluster kk.

Cross-Modal Learning to Rank

In “Simple to Complex Cross-modal Learning to Rank” (Luo et al., 2017), SPLD operates on pairwise ranking quadruplets (xk,zk,zj,ykj)(x_k, z_k, z_j, y_{kj}) with block-wise weights vkjv_{kj} across queries. The objective introduces a group-norm penalty on the per-query block vkv^{k}:

ϕ(v;λ,γ)=λk,jkvkjγkjkvkj\phi(v;\lambda, \gamma) = -\lambda \sum_{k, j\neq k} v_{kj} - \gamma \sum_k \sqrt{\sum_{j\neq k} v_{kj}}

Optimization proceeds by alternating gradient steps for the embedding parameters WW and threshold-based updates for vv, ensuring that rankings included in training are both low-loss (easy) and dispersed across queries (diverse).

Multi-Label Learning

In “Self-Paced Multi-Label Learning with Diversity” (Seyedi et al., 2019), SPLD is applied to large-scale multi-label scenarios:

minW,P[0,1]l×ni=1lj=1npij(yij,wixj)+Φ(W,C)λ ⁣i,jpij+γ ⁣i=1lp(i)2\min_{W, P \in [0,1]^{l \times n}} \sum_{i=1}^l \sum_{j=1}^n p_{ij}\,\ell(y_{ij}, w_i^\top x_j) + \Phi(W, C) - \lambda\!\sum_{i,j} p_{ij} + \gamma\!\sum_{i=1}^l \|p^{(i)}\|_2

where pijp_{ij} are soft importance weights for label-instance pairs, and γ\gamma penalizes concentrated selection within label blocks.

Deep Regression Forests

In “Self-Paced Deep Regression Forests with Consideration on Underrepresented Examples” (Pan et al., 2020), SPLD is generalized to continuous-output regression by directly rewarding high predictive entropy in the selection mechanism:

maxΘ,Π,v[0,1]nivi[logpF(yixi;Θ,Π)+γHi]+ζilog(vi+ζ/λ)\max_{\Theta, \Pi, v \in [0,1]^n} \sum_i v_i [\log p_F(y_i|x_i;\Theta,\Pi) + \gamma H_i ] + \zeta \sum_i \log(v_i + \zeta/\lambda)

High-entropy samples, often from underrepresented regions, are explicitly favored during early curriculum stages.

3. Optimization Strategies and Algorithmic Structure

All SPLD methods follow a block-coordinate or alternating optimization strategy, typically alternating between:

  • Model parameter update: Fix inclusion weights (W, v, or P) and minimize the training objective with weighted loss (standard SGD/backprop for deep models, closed-form for linear cases).
  • Weight update: For fixed model parameters, solve the subproblem in W/v/P, often decomposed into sorting sample losses within groups and thresholding by combined criteria on loss, diversity, and sometimes additional terms (e.g., entropy, fairness).

Closed-form or easily computable updates for W/v/P are available due to the convexity or quasi-convexity in the respective subproblems, even though the global objectives are non-convex due to deep parametrizations or group-norm penalties.

Curriculum pace (λ) is increased and diversity weight (γ) is decreased according to pre-set schedules (e.g., geometric multipliers per epoch), ensuring the curriculum gradually includes harder and more diverse examples.

4. Empirical Evaluation and Quantitative Impact

Extensive comparison to SPL, random sampling, and domain-specific SOTA baselines substantiates SPLD’s value:

  • Image classification (Thangarasa et al., 2018):
    • FashionMNIST: +0.95–1.35% test accuracy, faster convergence.
    • SVHN: ~20% fewer mini-batch steps to peak ~97.7% accuracy.
    • CIFAR-10: +1.17% over SPLD, +2.22% over random sampling.
    • CIFAR-100: +3.81% over SPLD, +4.75% over random.
  • Cross-modal ranking (Luo et al., 2017):
    • Pascal’07 image→text: mAP@5≈83.6% vs. next best 79.7%.
    • mAP gains of 3–5% on NUS-WIDE and Wiki.
  • Multi-label learning (Seyedi et al., 2019):
    • Outperforms LEML, ML-LRC, GLOCAL in 91.7% of 72 benchmark combinations.
    • Example (Business, 30% observed): Ranking Loss .044 (SPLD) vs .054, .061, .063 (baselines).
  • Deep regression forests (Pan et al., 2020):
    • Morph II (age): MAE 2.17 (DRFs) → 1.91 (SPUDRFs).
    • BIWI head pose: MAE 1.44 (DRFs) → 1.18 (SPUDRFs), plain self-paced (SP-DRFs) deteriorates to 2.08 without diversity bonus.

Ablation studies uniformly indicate that disabling the diversity component (γ=0\gamma=0) significantly reduces performance improvements, slows convergence, and results in models converging to biased or over-concentrated solutions.

5. Practical Implementation and Hyperparameter Sensitivity

SPLD approaches require several hyperparameters:

  • Pace parameter (λ): dictates the maximal loss (or minimal likelihood, in regression) of included samples at each stage. Typically initialized low and increased per epoch.
  • Diversity weight (γ): determines the strength of the diversity regularizer. Initialized relatively high to maximize early exposure to different modes/labels, then decreased.
  • Learning rates, batch sizes, and for embedding-based SPLD: specifics of clustering (number of clusters KK), embedding type (e.g., Magnet Loss for deep visual SPLD).

Grid searches or validation sweeps are reported for λ\lambda and γ\gamma, with moderate initial λ\lambda and γ\gamma in [1,10][1,10] yielding best results; both parameters significantly affect sample coverage and final generalization.

Algorithmic complexity per iteration typically scales with the number of active (selected) samples, owing to early epochs including only the easiest, most diverse instances, making SPLD computationally efficient for moderately large datasets.

6. Extensions, Limitations, and Generalization

SPLD frameworks are adaptable to diverse modeling setups wherever an “easiness” score and a sample grouping structure (clusters, labels, queries, predictive uncertainty) can be defined:

  • Direct application to continuous labels (regression) is feasible when predictive entropy can quantify diversity or rarity.
  • For multi-label or multi-task settings, per-block or per-label 2\ell_2-norm penalties ensure diversity at the output level.
  • Group assignment can be static (feature clustering, label categories) or dynamic (embedding learning with periodic reclustering).

Limitations include empirical tuning of hyperparameters (pace/diversity schedules, clustering counts/intervals), potential sensitivity to the initial grouping structure (for static clusters), and lack of formalized bias or fairness metrics in some existing implementations (notably in (Pan et al., 2020)). No global optimality or convergence rate guarantees beyond local stationary condition are stated for deep nonconvex instantiations; block coordinate ascent convergence is as per standard schemes for bi-convex or difference-of-convex objectives.

A plausible implication is SPLD’s flexibility in accommodating alternative groupings (semantic, metric, or uncertainty-based) and selection mechanisms, suggesting applicability to a range of domains facing imbalanced data, missing labels, or the need for robustness in curriculum learning.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Self-Paced Learning with Diversity (SPLD).