Incomplete Label Distribution Learning

Updated 23 November 2025

Incomplete Label Distribution Learning is a framework that recovers and predicts label distributions from partially observed signals using joint optimization and ADMM-based methods.
It employs methodologies such as joint modeling, low-rank/sparse decomposition, and adaptive weighting to overcome label imbalance and missingness.
Evaluated across domains like affective computing and bioinformatics, IncomLDL demonstrates enhanced label recovery and prediction under high missingness.

Incomplete Label Distribution Learning (IncomLDL) encompasses a suite of methodologies for learning predictive models in settings where only partial or incomplete label distributions are available for training. Unlike standard label distribution learning (LDL), which assumes full real-valued distributions over a label set for each instance, IncomLDL must estimate those distributions and train predictors when only logical, incomplete, or partially observed label signals are provided. This scenario is of substantial practical importance, as full label distribution annotation is often prohibitively expensive or infeasible in real-world tasks such as affective computing, bioinformatics, and perception analysis.

1. Formal Problem Definition and Variants

Let $X\in\mathbb{R}^{n\times d}$ denote the feature matrix and $D\in[0,1]^{n\times m}$ the true label distribution matrix, where $D_{ij}$ represents the degree to which label $j$ describes instance $i$ and $\sum_j D_{ij}=1$ . In standard LDL, $D$ is fully observed for all training samples. In IncomLDL, supervision is incomplete:

Incomplete logical labeling: Only binary (applicable/non-applicable) logical labels $Y^L\in\{0,1\}^{n\times m}$ are available (Jia et al., 2023).
Partially observed distributions: Only a subset of entries in $D$ is observed, typically encoded by a binary mask $\Omega\in\{0,1\}^{n\times m}$ (Kou et al., 17 Oct 2024, Li et al., 2023).
Hidden-label scenarios: Observed entries are re-normalized after missing labels are omitted, so the observed $D^o_{ij}$ represent only the proportions of the remaining (visible) mass (Jiang et al., 16 Nov 2025).

The overarching goal is to (a) recover plausible label distributions for all training instances and (b) learn a model $f$ or $P(x)$ capable of predicting label distributions for new instances, with a focus on robustly handling missingness, inherent data imbalance, and structural label dependencies.

2. Principal Methodological Approaches

2.1. Jointly Modeling Enhancement and Prediction

Early pipelines used label enhancement (LE) to reconstruct label distributions from logical labels, then applied external LDL algorithms. However, this two-stage approach overlooked joint dependencies. The DLDL model (Jia et al., 2023) unifies enhancement and LDL in a biconvex formulation:

$\min_{D,W}~ \mathrm{KL}(D\|P) + \alpha\,\mathrm{tr}(D^\top G D) + \beta \|D\|_F^2 + \gamma \|W\|_F^2$

subject to $0 \leq D \leq Y^L$ and simplex constraints, where $P$ is a softmax predictor and $G$ the graph Laplacian. DLDL leverages ADMM-based alternating updates for $W$ and $D$ . The method enforces strict support on valid labels, incorporates manifold regularization, and is theoretically justified via generalization bounds and a recovery guarantee.

2.2. Decomposition for Imbalance and Incompleteness

The I²LDL framework (Kou et al., 17 Oct 2024) targets both data imbalance and incomplete distributions by decomposing predicted label outputs into:

A low-rank factorized component $\mathbf{X}\mathbf{U}\mathbf{V}$ , effective for frequent ("head") labels.
A sparse component $\mathbf{X}\mathbf{H}$ for rare ("tail") labels.

The resulting problem is optimized via ADMM, alternating closed-form and projection steps. The model simultaneously enforces simplex normalization, nonnegativity, and structured regularization, with Rademacher-complexity-based generalization bounds.

2.3. Weighting Scheme without Explicit Regularization

The WInLDL approach (Li et al., 2023) introduces an adaptive weighting scheme in the empirical risk minimization, obviating the need for explicit regularization. The key weighted loss:

$\hat R_w(f) = \frac1N\sum_{i=1}^N\sum_{j=1}^C w_{ij}(f_j(x_i)-y_{ij})^2$

assigns larger weights to small observed degrees and gradually increasing weights to missing entries. This design imposes an implicit regularization effect, controlled via $\|W\|_\infty$ , and enables scalable closed-form ADMM-based optimization for the simplex-constrained output.

2.4. Proportional Information Constraints for Hidden Labels

Most prior methods assume missing entries have degree zero, freezing observed values. The HidLDL paradigm (Jiang et al., 16 Nov 2025) addresses a more realistic scenario in which missing labels force re-normalization of observed degrees. Proportional constraints enforce that the recovered $D$ aligns proportionally with these observed entries:

$k_i\,D^o_{i,:} = D_{i,:} \odot M_{i,:}$

for some scalar $k_i$ per row. The objective couples these constraints with graph Laplacian smoothness and a nuclear-norm (trace-norm) promoting global low rank:

$\min_D\, \frac12 \mathrm{tr}(D^\top G D) + \alpha \|D\|_* \quad \text{s.t.}~ D\mathbf{1}=1,~ D\ge0,~ D\in\mathrm{Cons}$

The associated ADMM algorithm alternates among gradient-projection, singular-value thresholding, and closed-form proportional updates.

3. Optimization Strategies and Theoretical Analysis

All state-of-the-art IncomLDL models rely on convex or biconvex objectives solvable via alternating minimization or ADMM:

Subproblems in $W$ (model parameters), $D$ (distribution matrix), or latent variables often admit closed-form or efficiently solvable steps.
Explicit constraints guarantee valid output distributions: simplex normalization, non-negativity, and support only on applicable labels or observed indices.
Generalization guarantees are typically established via Rademacher complexity analysis, with explicit risk bounds that depend on the regularization (implicit or explicit), hypothesis class norm, and the number of training samples (Jia et al., 2023, Kou et al., 17 Oct 2024, Li et al., 2023, Jiang et al., 16 Nov 2025).
In the HidLDL framework, proportional error bounds depend on the total “missing mass” per instance and diminish as the fraction of observed labels increases, reducing to standard LDL in the limit (Jiang et al., 16 Nov 2025).

4. Empirical Benchmarking and Comparative Results

Extensive comparison across 10–16 real-world and synthetic LDL datasets is standard practice. Datasets include facial expression (RAF-ML, SJAFFE), scene classification, sentiment (Twitter-LDL, Flickr-LDL), beauty assessment (FBP5500, SCUT), biological data (Yeast variants), and others. Common metrics—Chebyshev, Clark, Canberra, KL-divergence (all $\downarrow$ ), Cosine, and Intersection (both $\uparrow$ )—are used for both distribution recovery and prediction performance.

Key findings established in the literature include:

Method	Recovery (avg. rank, Chebyshev)	Prediction (avg. rank, Intersection)	Complexity	Regularization
DLDL (Jia et al., 2023)	1.00	1.00	Biconvex, ADMM	KL, manifold, $L_2$
I²LDL (Kou et al., 17 Oct 2024)	best/tied best	best/tied best	ADMM, $O(n^3)$ worst	Low-rank, sparse, $L_1$
WInLDL (Li et al., 2023)	1.1–1.4	1.4 (95% win rate)	Linear in $N,C$	Implicit (weights only)
HidLDL (Jiang et al., 16 Nov 2025)	0.4718–0.3803 (Canberra, wins 95.8%)	0.6890–0.6464 (Canberra, wins 91.6%)	ADMM, nuclear norm	Proportional, global/local

All referenced models advance recovery and prediction over legacy methods and complete LDL baselines when label information is missing or hidden. Statistical tests (e.g., Friedman, Bonferroni-Dunn) typically establish significance at $p<0.05$ .

5. Addressing Data Imbalance, Missingness, and Hidden Labels

A central challenge in IncomLDL is robustly estimating distributional targets despite label imbalance and partial observability:

I²LDL demonstrates that a dual low-rank/sparse decomposition is essential for modeling long-tailed label frequency distributions and handling missingness in both “head” and “tail” labels (Kou et al., 17 Oct 2024).
HidLDL corrects the bias introduced by zero-imputation by enforcing proportionality for observed entries, yielding more realistic recovery in settings where missing labels reflect annotation omissions rather than true zeros (Jiang et al., 16 Nov 2025).
WInLDL establishes that weighted empirical risk, with thoughtful weight scheduling for observed and missing degrees, can replace explicit regularization and offers strong scalability properties (Li et al., 2023).
Proportional constraints, local manifold regularization, and global low-rank priors are synergistically used to enhance resilience to high missing rates, though all approaches show some degradation as the proportion of missing labels increases or the number of observed labels per instance shrinks to one.

6. Limitations, Open Problems, and Future Directions

Scalability remains a challenge for models involving matrix inversions or large-scale quadratic programs (notably I²LDL and HidLDL). WInLDL, via closed-form and linear complexity updates, is better suited to large $N,C$ but may not capture rich label structure unless the weight schedule is carefully tuned.

Assumptions of missing-at-random may not hold in real annotation scenarios; extending these frameworks to handle non-random missingness and semi-supervised or active learning (where some full distributions are available) is plausible future work (Kou et al., 17 Oct 2024, Jia et al., 2023).

The integration of deeper feature representations and the learning of graph or label structure end-to-end represents an important potential advance (Jia et al., 2023, Jiang et al., 16 Nov 2025). In all cases, recovery and predictive performance are limited in extremely high-missingness regimes, especially when only one label remains visible per instance—a known challenge where only global structure can regularize otherwise degenerate solutions (Jiang et al., 16 Nov 2025).

In summary, Incomplete Label Distribution Learning is a mathematically principled, empirically validated framework for tackling the realities of incomplete and imbalanced label supervision. Ongoing research focuses on enhanced robustness, scalability, and the accommodation of more nuanced real-world labeling phenomena.