Improving conversion rate prediction via self-supervised pre-training in online advertising (2401.16432v1)
Abstract: The task of predicting conversion rates (CVR) lies at the heart of online advertising systems aiming to optimize bids to meet advertiser performance requirements. Even with the recent rise of deep neural networks, these predictions are often made by factorization machines (FM), especially in commercial settings where inference latency is key. These models are trained using the logistic regression framework on labeled tabular data formed from past user activity that is relevant to the task at hand. Many advertisers only care about click-attributed conversions. A major challenge in training models that predict conversions-given-clicks comes from data sparsity - clicks are rare, conversions attributed to clicks are even rarer. However, mitigating sparsity by adding conversions that are not click-attributed to the training set impairs model calibration. Since calibration is critical to achieving advertiser goals, this is infeasible. In this work we use the well-known idea of self-supervised pre-training, and use an auxiliary auto-encoder model trained on all conversion events, both click-attributed and not, as a feature extractor to enrich the main CVR prediction model. Since the main model does not train on non click-attributed conversions, this does not impair calibration. We adapt the basic self-supervised pre-training idea to our online advertising setup by using a loss function designed for tabular data, facilitating continual learning by ensuring auto-encoder stability, and incorporating a neural network into a large-scale real-time ad auction that ranks tens of thousands of ads, under strict latency constraints, and without incurring a major engineering cost. We show improvements both offline, during training, and in an online A/B test. Following its success in A/B tests, our solution is now fully deployed to the Yahoo native advertising system.
- Off-set: one-pass factorization of feature sets for online recommendation in persistent cold start settings. In Proc. RecSys’2013, pages 375–378, 2013.
- Adaptive online hyper-parameters tuning for ad event-prediction models. In Proceedings of the 26th International Conference on World Wide Web Companion, pages 672–679. International World Wide Web Conferences Steering Committee, 2017.
- Soft frequency capping for improved ad click prediction in yahoo gemini native. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 2793–2801, New York, NY, USA, 2019. Association for Computing Machinery.
- A Aizerman. Theoretical foundations of the potential function method in pattern recognition learning. Automation and remote control, 25:821–837, 1964.
- Learning and generalization in overparameterized neural networks, going beyond two layers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Stronger generalization bounds for deep nets via a compression approach. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 254–263. PMLR, 10–15 Jul 2018.
- Dana H. Ballard. Modular learning in neural networks. In Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 1, AAAI’87, page 279–284. AAAI Press, 1987.
- A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, page 144–152, New York, NY, USA, 1992. Association for Computing Machinery.
- On the generalization ability of on-line learning algorithms. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14. MIT Press, 2001.
- Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1097–1105, 2014.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Made: Masked autoencoder for distribution estimation. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 881–889, Lille, France, 07–09 Jul 2015. PMLR.
- Conet: Collaborative cross networks for cross-domain recommendation. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 667–676, 2018.
- Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
- On the computational efficiency of training neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
- Incorporating context and trends in news recommender systems. In Proceedings of the international conference on web intelligence, pages 1062–1068, 2017.
- A practical framework of conversion rate prediction for online display advertising. In Proceedings of the ADKDD’17, pages 1–9. 2017.
- Pay-per-action model for online advertising. In Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, pages 1–6, 2007.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Perceive your users in depth: Learning universal user representations from multiple e-commerce tasks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 596–605, 2018.
- Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999.
- Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
- Steffen Rendle. Factorization machines. In 2010 IEEE International Conference on Data Mining, pages 995–1000. IEEE, 2010.
- A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951.
- Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5):1299–1319, 1998.
- Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28:2503–2511, 2015.
- Data-driven multi-touch attribution models. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 258–264, 2011.
- Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management, pages 1441–1450, 2019.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Parameter-efficient transfer from sequential behaviors for user modeling and recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 1469–1478, 2020.
- Darec: Deep domain adaptation for cross-domain recommendation via transferring rating patterns. arXiv preprint arXiv:1905.10760, 2019.
- Barlow twins: Self-supervised learning via redundancy reduction. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12310–12320. PMLR, 18–24 Jul 2021.
- Optimal real-time bidding for display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1077–1086, 2014.
- How to retrain recommender system? a sequential meta-learning method. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1479–1488, 2020.
- Keep: An industrial pre-training framework for online recommendation via knowledge extraction and plugging. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 3684–3693, 2022.
- S3-rec: Self-supervised learning for sequential recommendation with mutual information maximization. In Proceedings of the 29th ACM international conference on information & knowledge management, pages 1893–1902, 2020.