KAML: Knowledge Transfer for Asymmetric Multi-Label Data
- The paper introduces KAML, a unified framework that leverages ADM, HKE, and RLU to address label incompleteness in multi-label conversion prediction.
- KAML improves model performance by mitigating distribution mismatch and efficiently recovering signals from sparse, asymmetric data.
- Extensive ablation studies demonstrate significant AUC gains, particularly for rare conversion tasks, validating its flexible masking approach.
The Knowledge transfer framework for Asymmetric Multi-Label data (KAML) is a method developed for unified learning in scenarios where multi-label data are conditionally missing, sparse, or highly skewed, as is characteristic in large-scale conversion rate prediction for online advertising. The framework overcomes the limitations of conventional multi-task learning (MTL) methods, which are challenged by distribution mismatch and label incompleteness due to diverse advertiser goals and selective label submission. KAML leverages a fine-grained knowledge transfer regime by integrating Attribution-Driven Masking (ADM), Hierarchical Knowledge Extraction (HKE), and Ranking-based Label Utilization (RLU), enabling robust learning from asymmetric, incomplete multi-label data and yielding significant empirical gains over prior approaches (Jia et al., 15 Dec 2025).
1. Problem Setting and Motivation
Real-world online advertising systems track multiple user conversion events (e.g., clicks, purchases), with different advertisers often optimizing for distinct or overlapping conversion types. However, label incompleteness arises because many advertisers choose to report only a subset—often a single—conversion type per sample, due to privacy, integration costs, or business strategy. This leads to:
- Label sparsity: Many conversion labels are missing or “not observed.”
- Distribution mismatch: Training data encompasses all available reported actions, but deployed models must infer on subsets corresponding to individual advertiser goals.
- Asymmetry: Not all advertisers provide the same coverage across conversion types, making per-label coverage highly heterogeneous.
Conventional MTL methods—where a multi-gate Mixture-of-Experts or similar architecture trains using the cross-entropy loss on available labels—either conservatively ignore unlabeled elements (one-hot masking) or naively treat missing labels as negatives, resulting in suboptimal learning signal and bias.
KAML targets these limitations by systematizing label inclusion/exclusion using data-driven, attribution-informed masking, and by stratifying the learning process to mitigate sample distribution shift between “original” and “inferred” labels (Jia et al., 15 Dec 2025).
2. Attribution-Driven Masking (ADM): Formalism and Rationale
The central technical component of KAML is ADM—a relaxed, data-informed masking rule replacing the conventional “base mask.”
Definition:
Given dataset
where is the input, the advertiser’s chosen conversion action, and for the -th label,
- Base mask: (sample only contributes to task if was explicitly chosen).
- ADM mask: , where is the count of conversions of type for advertising task (within a sliding temporal window), and a task-specific hyperparameter.
Loss function: The binary cross-entropy loss is then computed over all with :
Intuition: ADM admits a sample for a given conversion action if its source advertiser is known to submit that action with sufficient frequency to justify trust in the observed zeros. calibrates signal-versus-noise: Relaxing increases data coverage (especially for rare conversions), but also risks label noise if negative labels are, in reality, unlabeled positives.
Algorithmic sketch:
- For each advertiser and label , count over window .
- For each sample and label , include if .
- Train MTL model on masked entries.
Empirical findings: Augmenting an MMoE baseline with ADM (“MMoE+ADM”) produces consistent AUC improvements, notably +0.0299 for sparse/deep conversion tasks (Action D) compared to near-zero difference for dense/conventional tasks (Action C), confirming that flexible masking recovers valuable supervision especially for marginal signal domains (Jia et al., 15 Dec 2025).
3. Hierarchical Knowledge Extraction (HKE) and Label Utilization in KAML
With the ADM mask, KAML distinguishes two sample strata:
- Original samples (): Directly labeled by the advertiser for task , high trust.
- Extended samples (, ): Indirect labels validated per ADM criteria, higher coverage but noisier.
HKE: KAML implements a dual-tower architecture within each task unit. “Original” samples are routed to one sub-tower, “extended” to another, with hierarchical fusion downstream. This stratification mitigates label noise and distribution mismatch between sample types, ensuring each sub-tower learns from a more coherent, homogeneous distribution.
Ranking-based Label Utilization (RLU): ADM further enables definition of Type A (labeled positives) and Type C (“negative but possibly unlabeled positive”) samples. KAML introduces a pairwise ranking loss
where is the predicted score for -th task, to pull apart the logits of positively and possibly-positive samples. The overall loss is a convex combination of dynamically averaged and , with mixing coefficient .
4. Signal-versus-Noise Tradeoffs in Masking and Hyperparameter Selection
ADM’s strength derives from converting previously discarded, incomplete labels into training signals, enabling significant knowledge transfer especially for tail/rare conversion types. However, this statistical relaxation introduces a tradeoff:
- Benefit: Enhanced sample efficiency, broader cross-task learning, especially for underrepresented conversion paths.
- Risk: Label noise if negative labels for rarely reported tasks are false (missing positives), or if specific advertisers systematically exclude certain conversions from reporting.
hyperparameters serve as the primary regularization “knob.” Lower thresholds maximize inclusivity (signal-rich, noisy), while higher thresholds are more conservative (precision-rich, sparse). Empirically, per-task validation sweeps are used to set optimal for each conversion target (Jia et al., 15 Dec 2025).
5. Quantitative Impact and Ablation Studies
Extensive empirical results support the efficacy of ADM within KAML:
| Task | Baseline AUC | MMoE+ADM AUC | AUC Gain |
|---|---|---|---|
| Action A | 0.8448 | 0.8473 | +0.0025 |
| Action B | 0.8921 | 0.8930 | +0.0009 |
| Action C | 0.9139 | 0.9138 | ~0 |
| Action D | 0.8300 | 0.8599 | +0.0299 |
| Action E | 0.9061 | 0.9118 | +0.0057 |
| Overall | 0.9108 | 0.9122 | +0.0014 |
In addition, log-loss generally decreases, confirming both improved calibration and discrimination. These gains are amplified for tasks with highly incomplete labels, substantiating that ADM—and by extension, KAML—successfully transfer knowledge across asymmetric task boundaries (Jia et al., 15 Dec 2025).
6. Integration with Broader Attribution-Driven Masking Paradigms
KAML’s ADM connects to a broader literature on attribution-driven or attention-driven masking in deep learning:
- In medical imaging, attention-driven masking strategies (e.g., SAM in MSMAE) leverage supervised attention maps to focus reconstruction on clinically relevant regions, yielding large accuracy and efficiency gains (Mao et al., 2023).
- Recursive input masking based on input gradients achieves near-perfect masked-input accuracy by iteratively filtering irrelevant features, subject to model linearization constraints (Lee et al., 2021).
- Differentiable masking at the input or hidden layer interfaces, as in DiffMask, combines faithfulness and computational efficiency for model explanation and interpretability (Cao et al., 2020).
- Sparse, entropy-guided dynamic neuron masking at the architectural level supports highly localized knowledge editing in LLMs, using attribution-defined binary neuron masks for minimal, task-specific parameter updates (Liu et al., 25 Oct 2025).
A key commonality is the use of attribution signals—historical conversion frequency, gradient saliency, self-attention, or functional effect measures—to define fine-grained, data-dependent masks, thereby rationalizing inclusivity in training under label or parameter sparsity.
7. Practical Implications, Applications, and Limitations
KAML advances the state-of-the-art for incomplete and skewed multi-label data in large-scale online learning platforms, particularly in ad conversion prediction. Key practical implications are:
- Maximized data utility: By relaxing stringent mask constraints, KAML recovers substantial signal from otherwise wasted multi-label data, especially in rare outcome regimes.
- Robust deployment: Hierarchical separation of sample strata and ranking-based loss ensure better generalization when distributions at training and inference diverge.
- Scalability: Hyperparameter tuning for masking thresholds () is systematic and per-task, facilitating deployment across complex, evolving business domains.
Limitations include reliance on appropriately specified historical attribution windows (), sensitivity to extreme label skew (where threshold tuning is nontrivial), and the requirement for periodic evaluation of mask trustworthiness as data collection practices evolve.
A plausible implication is that further generalization of attribution-driven masking strategies could enable effective learning in other high-sparsity ML scenarios, such as medical diagnostics with missing labels, large-scale language understanding with incomplete annotations, or continual learning under resource-limited edit constraints.
References
- (Jia et al., 15 Dec 2025) No One Left Behind: How to Exploit the Incomplete and Skewed Multi-Label Data for Conversion Rate Prediction
- (Mao et al., 2023, Lee et al., 2021, Cao et al., 2020, Liu et al., 25 Oct 2025) (for broader context on attribution-driven masking methods)