Ensemble Knowledge Distillation for CTR Prediction (2011.04106v2)
Abstract: Recently, deep learning-based models have been widely studied for click-through rate (CTR) prediction and lead to improved prediction accuracy in many industrial applications. However, current research focuses primarily on building complex network architectures to better capture sophisticated feature interactions and dynamic user behaviors. The increased model complexity may slow down online inference and hinder its adoption in real-time applications. Instead, our work targets at a new model training strategy based on knowledge distillation (KD). KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. The KD strategy not only allows us to simplify the student model as a vanilla DNN model but also achieves significant accuracy improvements over the state-of-the-art teacher models. The benefits thus motivate us to further explore the use of a powerful ensemble of teachers for more accurate student model training. We also propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss. We conduct comprehensive experiments against 12 existing models and across three industrial datasets. Both offline and online A/B testing results show the effectiveness of our KD-based training strategy.
- 2009. The BigChaos Solution to the Netflix Grand Prize. https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf
- 2015. 4 Idiots’ Approach for Click-through Rate Prediction. https://www.csie.ntu.edu.tw/~r01922136/slides/kaggle-avazu.pdf
- Adversarial Distillation for Efficient Recommendation with External Knowledge. ACM Trans. Inf. Syst. 37, 1 (2019), 12:1–12:28.
- Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems (DLRS@RecSys). 7–10.
- Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 191–198.
- Zhenhua Dong Xiuqiang He Weike Pan Zhong Ming Dugang Liu, Pengxiang Cheng. 2020. A General Knowledge Distillation Framework for Counterfactual Recommendation via Uniform Data. In International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR).
- Deep Session Interest Network for Click-Through Rate Prediction. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 2301–2307.
- DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731.
- Practical Lessons from Predicting Clicks on Ads at Facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, (ADKDD). 5:1–5:9.
- Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531 (2015).
- FiBiNET: combining feature importance and bilinear feature interaction for click-through rate prediction. In Proceedings of the 13th ACM Conference on Recommender Systems (RecSys).
- Field-aware Factorization Machines for CTR Prediction. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 43–50.
- DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 384–394.
- Combining Decision Trees and Neural Networks for Learning-to-Rank in Personal Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 2032–2040.
- Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). ACM, 539–548.
- xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. In International Conference on Knowledge Discovery & Data Mining (KDD). 1754–1763.
- Model Ensemble for Click Prediction in Bing Search Ads. In Proceedings of the 26th International Conference on World Wide Web Companion (WWW).
- Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction. In The World Wide Web Conference (WWW). ACM.
- Asit K. Mishra and Debbie Marr. 2018. Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. In Proceedings of the 6th International Conference on Learning Representations, (ICLR).
- Product-Based Neural Networks for User Response Prediction over Multi-Field Categorical Data. TOIS (2019).
- Steffen Rendle. 2010. Factorization Machines. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM). 995–1000.
- FitNets: Hints for Thin Deep Nets. In Proceedings of International Conference on Learning Representations, (ICLR).
- MEAL: Multi-Model Ensemble via Adversarial Learning. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). 4886–4893.
- AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). ACM, 1161–1170.
- Multilingual Neural Machine Translation with Knowledge Distillation. In 7th International Conference on Learning Representations (ICLR).
- Deep & Cross Network for Ad Click Predictions. In Proceedings of the 11th International Workshop on Data Mining for Online Advertising (ADKDD). 12:1–12:7.
- Distilled Person Re-Identification: Towards a More Scalable System. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Privileged Features Distillation for E-Commerce Recommendations. CoRR abs/1907.05171 (2019).
- A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Sergey Zagoruyko and Nikos Komodakis. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In International Conference on Learning Representations, (ICLR) 2017.
- Distilling Structured Knowledge into Embeddings for Explainable and Accurate Recommendation. In Conference on Web Search and Data Mining (WSDM). 735–743.
- Deep Interest Network for Click-Through Rate Prediction. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1059–1068.