Papers
Topics
Authors
Recent
Search
2000 character limit reached

KAML: Knowledge Transfer for Asymmetric Multi-Label Data

Updated 22 December 2025
  • The paper introduces KAML, a unified framework that leverages ADM, HKE, and RLU to address label incompleteness in multi-label conversion prediction.
  • KAML improves model performance by mitigating distribution mismatch and efficiently recovering signals from sparse, asymmetric data.
  • Extensive ablation studies demonstrate significant AUC gains, particularly for rare conversion tasks, validating its flexible masking approach.

The Knowledge transfer framework for Asymmetric Multi-Label data (KAML) is a method developed for unified learning in scenarios where multi-label data are conditionally missing, sparse, or highly skewed, as is characteristic in large-scale conversion rate prediction for online advertising. The framework overcomes the limitations of conventional multi-task learning (MTL) methods, which are challenged by distribution mismatch and label incompleteness due to diverse advertiser goals and selective label submission. KAML leverages a fine-grained knowledge transfer regime by integrating Attribution-Driven Masking (ADM), Hierarchical Knowledge Extraction (HKE), and Ranking-based Label Utilization (RLU), enabling robust learning from asymmetric, incomplete multi-label data and yielding significant empirical gains over prior approaches (Jia et al., 15 Dec 2025).

1. Problem Setting and Motivation

Real-world online advertising systems track multiple user conversion events (e.g., clicks, purchases), with different advertisers often optimizing for distinct or overlapping conversion types. However, label incompleteness arises because many advertisers choose to report only a subset—often a single—conversion type per sample, due to privacy, integration costs, or business strategy. This leads to:

  • Label sparsity: Many conversion labels are missing or “not observed.”
  • Distribution mismatch: Training data encompasses all available reported actions, but deployed models must infer on subsets corresponding to individual advertiser goals.
  • Asymmetry: Not all advertisers provide the same coverage across conversion types, making per-label coverage highly heterogeneous.

Conventional MTL methods—where a multi-gate Mixture-of-Experts or similar architecture trains using the cross-entropy loss on available labels—either conservatively ignore unlabeled elements (one-hot masking) or naively treat missing labels as negatives, resulting in suboptimal learning signal and bias.

KAML targets these limitations by systematizing label inclusion/exclusion using data-driven, attribution-informed masking, and by stratifying the learning process to mitigate sample distribution shift between “original” and “inferred” labels (Jia et al., 15 Dec 2025).

2. Attribution-Driven Masking (ADM): Formalism and Rationale

The central technical component of KAML is ADM—a relaxed, data-informed masking rule replacing the conventional “base mask.”

Definition:

Given dataset

D={(xi,oi,yi1,,yiN)}i=1DD = \{(x_i, o_i, \langle y_i^1, \dots, y_i^N \rangle)\}_{i=1}^{|D|}

where xix_i is the input, oio_i the advertiser’s chosen conversion action, and yij{0,1}y_i^j \in \{0,1\} for the jj-th label,

  • Base mask: Maskijbase=1(oi=j)Mask_{ij}^{base} = \mathbf{1}(o_i = j) (sample ii only contributes to task jj if jj was explicitly chosen).
  • ADM mask: MaskijADM=1(ceijαj)Mask_{ij}^{ADM} = \mathbf{1}(c_{e_i}^j \geq \alpha_j), where ceijc_{e_i}^j is the count of conversions of type jj for advertising task eie_i (within a sliding temporal window), and αj\alpha_j a task-specific hyperparameter.

Loss function: The binary cross-entropy loss is then computed over all (i,j)(i,j) with MaskijADM=1Mask_{ij}^{ADM} = 1:

LBCE=1Di=1Dj=1NMaskijADM[yijlogy^ij+(1yij)log(1y^ij)].\mathcal{L}_{BCE} = -\frac{1}{|D|}\sum_{i=1}^{|D|}\sum_{j=1}^N Mask_{ij}^{ADM} \left[ y_i^j \log \hat{y}_i^j + (1-y_i^j)\log(1-\hat{y}_i^j)\right].

Intuition: ADM admits a sample for a given conversion action if its source advertiser is known to submit that action with sufficient frequency to justify trust in the observed zeros. αj\alpha_j calibrates signal-versus-noise: Relaxing αj\alpha_j increases data coverage (especially for rare conversions), but also risks label noise if negative labels are, in reality, unlabeled positives.

Algorithmic sketch:

  • For each advertiser ee and label jj, count cejc_e^j over window TT.
  • For each sample ii and label jj, include if ceijαjc_{e_i}^j \geq \alpha_j.
  • Train MTL model on masked entries.

Empirical findings: Augmenting an MMoE baseline with ADM (“MMoE+ADM”) produces consistent AUC improvements, notably +0.0299 for sparse/deep conversion tasks (Action D) compared to near-zero difference for dense/conventional tasks (Action C), confirming that flexible masking recovers valuable supervision especially for marginal signal domains (Jia et al., 15 Dec 2025).

3. Hierarchical Knowledge Extraction (HKE) and Label Utilization in KAML

With the ADM mask, KAML distinguishes two sample strata:

  • Original samples (oi=jo_i = j): Directly labeled by the advertiser for task jj, high trust.
  • Extended samples (oijo_i \neq j, MaskijADM=1Mask_{ij}^{ADM} = 1): Indirect labels validated per ADM criteria, higher coverage but noisier.

HKE: KAML implements a dual-tower architecture within each task unit. “Original” samples are routed to one sub-tower, “extended” to another, with hierarchical fusion downstream. This stratification mitigates label noise and distribution mismatch between sample types, ensuring each sub-tower learns from a more coherent, homogeneous distribution.

Ranking-based Label Utilization (RLU): ADM further enables definition of Type A (labeled positives) and Type C (“negative but possibly unlabeled positive”) samples. KAML introduces a pairwise ranking loss

LRanking=1D2ijk=1N1(yik>yjk)log(1+exp[(siksjk)]),\mathcal{L}_{Ranking} = -\frac{1}{|D|^2} \sum_{i \neq j} \sum_{k=1}^N \mathbf{1}(y_i^k > y_j^k) \log \left(1 + \exp[-(s_i^k - s_j^k)]\right),

where siks_i^k is the predicted score for kk-th task, to pull apart the logits of positively and possibly-positive samples. The overall loss is a convex combination of dynamically averaged LBCE\overline{\mathcal{L}_{BCE}} and LRanking\mathcal{L}_{Ranking}, with mixing coefficient γ\gamma.

4. Signal-versus-Noise Tradeoffs in Masking and Hyperparameter Selection

ADM’s strength derives from converting previously discarded, incomplete labels into training signals, enabling significant knowledge transfer especially for tail/rare conversion types. However, this statistical relaxation introduces a tradeoff:

  • Benefit: Enhanced sample efficiency, broader cross-task learning, especially for underrepresented conversion paths.
  • Risk: Label noise if negative labels for rarely reported tasks are false (missing positives), or if specific advertisers systematically exclude certain conversions from reporting.

αj\alpha_j hyperparameters serve as the primary regularization “knob.” Lower thresholds maximize inclusivity (signal-rich, noisy), while higher thresholds are more conservative (precision-rich, sparse). Empirically, per-task validation sweeps are used to set optimal αj\alpha_j for each conversion target (Jia et al., 15 Dec 2025).

5. Quantitative Impact and Ablation Studies

Extensive empirical results support the efficacy of ADM within KAML:

Task Baseline AUC MMoE+ADM AUC AUC Gain
Action A 0.8448 0.8473 +0.0025
Action B 0.8921 0.8930 +0.0009
Action C 0.9139 0.9138 ~0
Action D 0.8300 0.8599 +0.0299
Action E 0.9061 0.9118 +0.0057
Overall 0.9108 0.9122 +0.0014

In addition, log-loss generally decreases, confirming both improved calibration and discrimination. These gains are amplified for tasks with highly incomplete labels, substantiating that ADM—and by extension, KAML—successfully transfer knowledge across asymmetric task boundaries (Jia et al., 15 Dec 2025).

6. Integration with Broader Attribution-Driven Masking Paradigms

KAML’s ADM connects to a broader literature on attribution-driven or attention-driven masking in deep learning:

  • In medical imaging, attention-driven masking strategies (e.g., SAM in MSMAE) leverage supervised attention maps to focus reconstruction on clinically relevant regions, yielding large accuracy and efficiency gains (Mao et al., 2023).
  • Recursive input masking based on input gradients achieves near-perfect masked-input accuracy by iteratively filtering irrelevant features, subject to model linearization constraints (Lee et al., 2021).
  • Differentiable masking at the input or hidden layer interfaces, as in DiffMask, combines faithfulness and computational efficiency for model explanation and interpretability (Cao et al., 2020).
  • Sparse, entropy-guided dynamic neuron masking at the architectural level supports highly localized knowledge editing in LLMs, using attribution-defined binary neuron masks for minimal, task-specific parameter updates (Liu et al., 25 Oct 2025).

A key commonality is the use of attribution signals—historical conversion frequency, gradient saliency, self-attention, or functional effect measures—to define fine-grained, data-dependent masks, thereby rationalizing inclusivity in training under label or parameter sparsity.

7. Practical Implications, Applications, and Limitations

KAML advances the state-of-the-art for incomplete and skewed multi-label data in large-scale online learning platforms, particularly in ad conversion prediction. Key practical implications are:

  • Maximized data utility: By relaxing stringent mask constraints, KAML recovers substantial signal from otherwise wasted multi-label data, especially in rare outcome regimes.
  • Robust deployment: Hierarchical separation of sample strata and ranking-based loss ensure better generalization when distributions at training and inference diverge.
  • Scalability: Hyperparameter tuning for masking thresholds (αj\alpha_j) is systematic and per-task, facilitating deployment across complex, evolving business domains.

Limitations include reliance on appropriately specified historical attribution windows (TT), sensitivity to extreme label skew (where threshold tuning is nontrivial), and the requirement for periodic evaluation of mask trustworthiness as data collection practices evolve.

A plausible implication is that further generalization of attribution-driven masking strategies could enable effective learning in other high-sparsity ML scenarios, such as medical diagnostics with missing labels, large-scale language understanding with incomplete annotations, or continual learning under resource-limited edit constraints.

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Knowledge Transfer Framework for Asymmetric Multi-Label Data (KAML).