Helen: Optimizing CTR Prediction Models with Frequency-wise Hessian Eigenvalue Regularization (2403.00798v1)
Abstract: Click-Through Rate (CTR) prediction holds paramount significance in online advertising and recommendation scenarios. Despite the proliferation of recent CTR prediction models, the improvements in performance have remained limited, as evidenced by open-source benchmark assessments. Current researchers tend to focus on developing new models for various datasets and settings, often neglecting a crucial question: What is the key challenge that truly makes CTR prediction so demanding? In this paper, we approach the problem of CTR prediction from an optimization perspective. We explore the typical data characteristics and optimization statistics of CTR prediction, revealing a strong positive correlation between the top hessian eigenvalue and feature frequency. This correlation implies that frequently occurring features tend to converge towards sharp local minima, ultimately leading to suboptimal performance. Motivated by the recent advancements in sharpness-aware minimization (SAM), which considers the geometric aspects of the loss landscape during optimization, we present a dedicated optimizer crafted for CTR prediction, named Helen. Helen incorporates frequency-wise Hessian eigenvalue regularization, achieved through adaptive perturbations based on normalized feature frequencies. Empirical results under the open-source benchmark framework underscore Helen's effectiveness. It successfully constrains the top eigenvalue of the Hessian matrix and demonstrates a clear advantage over widely used optimization algorithms when applied to seven popular models across three public benchmark datasets on BARS. Our code locates at github.com/NUS-HPC-AI-Lab/Helen.
- Sharp-MAML: Sharpness-Aware Model-Agnostic Meta Learning. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 10–32.
- Controlling popularity bias in learning-to-rank recommendation. In Proceedings of the eleventh ACM conference on recommender systems. 42–46.
- The unfairness of popularity bias in recommendation. In Proceedings of the RecSys Workshop on Recommendation in Multistakeholder Environments (RMSE).
- Avazu. 2015. Avazu Click-Through Rate Prediction.
- Stephen Bonner and Flavian Vasile. 2018. Causal embeddings for recommendation. In Proceedings of the 12th ACM conference on recommender systems. 104–112.
- Léon Bottou. 1991. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91, 8 (1991), 12.
- Pepnet: Parameter and embedding personalized network for infusing with personalized prior information. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3795–3804.
- Entropy-sgd: Biasing gradient descent into wide valleys. In The 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=B1YfAfcgl
- When vision transformers outperform resnets without pre-training or strong data augmentations. In International Conference on Learning Representations.
- Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675 (2023).
- Wide & Deep Learning for Recommender Systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems.
- Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). 191–198.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning. PMLR, 1019–1028.
- Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In Proceedings of the 4th International Conference on Learning Representations. 1–4.
- Efficient sharpness-aware minimization for improved training of neural networks. arXiv preprint arXiv:2110.03141 (2021).
- Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12, 7 (2010).
- Gintare Karolina Dziugaite and Daniel M Roy. 2017. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008 (2017).
- Sharpness-aware Minimization for Efficiently Improving Generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=6Tm1mposlrM
- DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In International Joint Conference on Artificial Intelligence (IJCAI). 1725–1731.
- Asymmetric Valleys: Beyond Sharp and Flat Local Minima. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/01d8bae291b1e4724443375634ccfa0e-Paper.pdf
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on 14, 8 (2012), 2.
- Coursera: Neural networks for machine learning. Lecture 9c: Using noise as a regularizer (2012).
- Sepp Hochreiter and Jrgen Schmidhuber. 1997. Flat minima. Neural computation 9, 1 (1997), 1–42.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018).
- Adaptive mixtures of local experts. Neural computation 3, 1 (1991), 79–87.
- On the local minima of the empirical risk. arXiv preprint arXiv:1803.09357 (2018).
- Correcting Popularity Bias by Enhancing Recommendation Neutrality. RecSys Posters 10 (2014).
- On the Maximum Hessian Eigenvalue and Generalization. In Proceedings on ”I Can’t Believe It’s Not Better! - Understanding Deep Learning Through Empirical Falsification” at NeurIPS 2022 Workshops (Proceedings of Machine Learning Research, Vol. 187), Javier Antorán, Arno Blaas, Fan Feng, Sahra Ghalebikesabi, Ian Mason, Melanie F. Pradier, David Rohde, Francisco J. R. Ruiz, and Aaron Schein (Eds.). PMLR, 51–65.
- On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=H1oyRlYgg
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.6980
- Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning. PMLR, 5905–5914.
- Criteo Labs. 2014. Display Advertising Challenge.
- Gradient descent only converges to minimizers. In Conference on learning theory. PMLR, 1246–1257.
- Visualizing the loss landscape of neural nets. arXiv preprint arXiv:1712.09913 (2017).
- Interpretable Click-Through Rate Prediction through Hierarchical Attention. In The International Conference on Web Search and Data Mining (WSDM). 313–321.
- Model ensemble for click prediction in bing search ads. In Proceedings of the 26th international conference on world wide web companion. 689–698.
- A general knowledge distillation framework for counterfactual recommendation via uniform data. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 831–840.
- Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. arXiv preprint arXiv:2305.14342 (2023).
- Category-Specific CNN for Visual-aware CTR Prediction at JD. com. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2686–2696.
- On the Variance of the Adaptive Learning Rate and Beyond. In International Conference on Learning Representations.
- Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12360–12370.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
- Second order online collaborative filtering. In Asian Conference on Machine Learning. PMLR, 325–340.
- Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.
- Deep Learning Recommendation Model for Personalization and Recommendation Systems. arXiv abs/1906.00091 (2019).
- Yurii Nesterov. 1983. A method for unconstrained convex minimization problem with the rate of convergence O (1/k2). In Dokl. Akad. Nauk. SSSR, Vol. 269. 543.
- Gal Oestreicher-Singer and Arun Sundararajan. 2012. Recommendation networks and the long tail of electronic commerce. Mis quarterly (2012), 65–83.
- Robin Pemantle. 1990. Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability 18, 2 (1990), 698–712.
- Boris T Polyak. 1964. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics 4, 5 (1964), 1–17.
- Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural networks 12, 1 (1999), 145–151.
- Product-Based Neural Networks for User Response Prediction. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM). 1149–1154.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019).
- Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web. 521–530.
- Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407.
- Internet advertising effectiveness: the effect of design on click-through rates for banner ads. International journal of advertising 26, 4 (2007), 527–541.
- Post-click conversion modeling and analysis for non-guaranteed delivery display advertising. In Proceedings of the fifth ACM international conference on Web search and data mining. 293–302.
- Empirical analysis of the hessian of over-parametrized neural networks. In International Conference on Learning Representations (Workshop Track).
- Recommendations as Treatments: Debiasing Learning and Evaluation. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA, 1670–1679.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
- Samuel L Smith and Quoc V Le. 2017. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451 (2017).
- Taobao. 2018. Alibaba Ad Display/Click Data Prediction.
- Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using pac-bayesian analysis. In International Conference on Machine Learning. PMLR, 9636–9647.
- Deep & Cross Network for Ad Click Predictions. In Proceedings of the 11th International Workshop on Data Mining for Online Advertising (ADKDD). 12:1–12:7.
- DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. In Proceedings of the Web Conference 2021.
- How Sharpness-Aware Minimization Minimizes Sharpness?. In International Conference on Learning Representations.
- Kraken: memory-efficient continual learning for large-scale real-time recommendations. In International Conference for High Performance Computing, Networking, Storage, and Analysis (SC).
- Coupled group lasso for web-scale ctr prediction in display advertising. In International conference on machine learning. PMLR, 802–810.
- Positively scale-invariant flatness of relu neural networks. arXiv preprint arXiv:1903.02237 (2019).
- Challenging the long tail recommendation. Proceedings of the VLDB Endowment 5, 9 (2012), 896–907.
- Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems 33 (2020), 15383–15393.
- Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–20.
- AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management.
- Disentangling user interest and conformity for recommendation with causal embedding. In Proceedings of the Web Conference 2021. 2980–2991.
- CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU. arXiv preprint arXiv:2204.06240 (2022).
- Deep Interest Evolution Network for Click-Through Rate Prediction. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 5941–5948. https://doi.org/10.1609/aaai.v33i01.33015941
- Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.
- BARS: Towards Open Benchmarking for Recommender Systems. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 2912–2923. https://doi.org/10.1145/3477495.3531723
- Open Benchmarking for Click-Through Rate Prediction. In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021. ACM, 2759–2769. https://doi.org/10.1145/3459637.3482486
- Surrogate gap minimization improves sharpness-aware training. arXiv preprint arXiv:2203.08065 (2022).