Calibration-compatible Listwise Distillation of Privileged Features for CTR Prediction (2312.08727v1)
Abstract: In machine learning systems, privileged features refer to the features that are available during offline training but inaccessible for online serving. Previous studies have recognized the importance of privileged features and explored ways to tackle online-offline discrepancies. A typical practice is privileged features distillation (PFD): train a teacher model using all features (including privileged ones) and then distill the knowledge from the teacher model using a student model (excluding the privileged features), which is then employed for online serving. In practice, the pointwise cross-entropy loss is often adopted for PFD. However, this loss is insufficient to distill the ranking ability for CTR prediction. First, it does not consider the non-i.i.d. characteristic of the data distribution, i.e., other items on the same page significantly impact the click probability of the candidate item. Second, it fails to consider the relative item order ranked by the teacher model's predictions, which is essential to distill the ranking ability. To address these issues, we first extend the pointwise-based PFD to the listwise-based PFD. We then define the calibration-compatible property of distillation loss and show that commonly used listwise losses do not satisfy this property when employed as distillation loss, thus compromising the model's calibration ability, which is another important measure for CTR prediction. To tackle this dilemma, we propose Calibration-compatible LIstwise Distillation (CLID), which employs carefully-designed listwise distillation loss to achieve better ranking ability than the pointwise-based PFD while preserving the model's calibration ability. We theoretically prove it is calibration-compatible. Extensive experiments on public datasets and a production dataset collected from the display advertising system of Alibaba further demonstrate the effectiveness of CLID.
- Learning a deep listwise context model for ranking refinement. In SIGIR. 135–144.
- Unbiased learning to rank with unbiased propensity estimation. In SIGIR. 385–394.
- Learning groupwise multivariate scoring functions using deep neural networks. In SIGIR. 85–92.
- Unbiased learning to rank: online or offline? TOIS ([n. d.]).
- Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance. In CIKM. 4502–4508.
- CAN: feature co-action network for click-through rate prediction. In WSDM. 57–65.
- A stochastic treatment of learning to rank scoring functions. In WSDM. 61–69.
- Learning to rank using gradient descent. In ICML. 89–96.
- Learning to rank: from pairwise approach to listwise approach. In ICML. 129–136.
- Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach. In KDD. 1–11.
- Ranking and calibrating click-attributed purchases in performance display advertising. In KDD. 1–6.
- Adapting interactional observation embedding for counterfactual learning to rank. In SIGIR. 285–294.
- Ranking measures and loss functions in learning to rank. In NeurIPS. 315–323.
- David Cossock and Tong Zhang. 2008. Statistical analysis of Bayes optimal subset ranking. IEEE Transactions on Information Theory 54, 11 (2008), 5140–5154.
- Modeling users’ contextualized page-wise feedback for click-through rate prediction in e-commerce search. In WSDM. 262–270.
- Rec4Ad: A Free Lunch to Mitigate Sample Selection Bias for Ads CTR Prediction in Taobao. CIKM (2023).
- Large margin rank boundaries for ordinal regression. In Advances in large margin classifiers. 115–132.
- Real negatives matter: continuous training with real negatives for delayed feedback modeling. In KDD. 2890–2898.
- PAL: a position-bias aware learning framework for CTR prediction in live recommender systems. In RecSys. 452–456.
- Improving deep learning for airbnb search. In KDD. 2822–2830.
- Learning-to-Rank with BERT in TF-Ranking. arXiv preprint arXiv:2004.08476 (2020).
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV. 1026–1034.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Reusing historical interaction data for faster online learning to rank for IR. In WSDM. 183–192.
- Deep presentation bias integrated framework for CTR prediction. In CIKM. 4049–4053.
- Learning with privileged information for efficient image super-resolution. In ECCV. 465–482.
- McRank: learning to rank using multiple classification and gradient boosting. In NeurIPS. 897–904.
- Decision-making context interaction network for click-through rate prediction. AAAI (2023).
- Model ensemble for click prediction in bing search ads. In Web Conf. 689–698.
- Position awareness modeling with knowledge distillation for CTR prediction. In RecSys. 562–566.
- Unifying distillation and privileged information. In ICLR. 1–12.
- Post-learning optimization of tree ensembles for efficient ranking. In SIGIR. 949–952.
- Graph distillation for action detection with privileged modalities. In ECCV. 166–183.
- Permutation equivariant document interaction network for neural learning to rank. In ICTIR. 145–148.
- Personalized re-ranking for recommendation. In RecSys. 3–11.
- Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 (2013).
- A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13 (2010), 375–397.
- Query-level loss functions for information retrieval. Information Processing & Management 44, 2 (2008), 838–855.
- Joint optimization of ranking and calibration with contextualized hybrid model. In KDD.
- One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction. In CIKM. 4104–4113.
- Covariate shift adaptation by importance weighted cross validation. JMLR 8 (2007), 985–1005.
- Adith Swaminathan and Thorsten Joachims. 2015. Batch learning from logged bandit feedback through counterfactual risk minimization. JMLR 16, 1 (2015), 1731–1755.
- Learning using privileged information: similarity control and knowledge transfer. JMLR 16, 1 (2015), 2023–2049.
- Emotion recognition with the help of privileged information. IEEE Transactions on Autonomous Mental Development 7, 3 (2015), 189–200.
- Learning to rank with selection bias in personal search. In SIGIR. 115–124.
- Position bias estimation for unbiased learning to rank in personal search. In WSDM. 610–618.
- Adversarial gradient driven exploration for deep click-through rate prediction. In KDD. 2050–2058.
- Listwise approach to learning to rank: theory and algorithm. In ICML. 1192–1199.
- Privileged features distillation at Taobao recommendations. In KDD. 2590–2598.
- Scale calibration of deep ranking models. In KDD. 4300–4309.
- Toward understanding privileged features distillation in learning-to-rank. In NeurIPS. 1–12.
- Hai-Tao Yu. 2020. PT-ranking: A benchmarking platform for neural learning-to-rank. arXiv preprint arXiv:2008.13368 (2020).
- KEEP: An industrial pre-training framework for online recommendation via knowledge extraction and plugging. In CIKM. 3684–3693.
- Towards Disentangling Relevance and Bias in Unbiased Learning to Rank. In KDD. 5618–5627.
- Towards understanding the overfitting phenomenon of deep click-through rate models. In CIKM. 2671–2680.
- Entire Space Cascade Delayed Feedback Modeling for Effective Conversion Rate Prediction. In CIKM. 4981–4987.
- COPR: Consistency-Oriented Pre-Ranking for Online Advertising. CIKM (2023).
- Recommending what video to watch next: a multitask ranking system. In RecSys. 43–51.
- CBR: context bias aware recommendation for debiasing user modeling and click prediction. In Web Conf. 2268–2276.
- Deep interest network for click-through rate prediction. In KDD. 1059–1068.
- Optimized cost per click in taobao display advertising. In KDD. 2191–2200.