Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender Systems (2404.08671v1)
Abstract: Over the last decades has emerged a rich literature on the evaluation of recommendation systems. However, less is written about how to efficiently combine different evaluation methods from this rich field into a single efficient evaluation funnel. In this paper we aim to build intuition for how to choose evaluation methods, by presenting a novel framework that simplifies the reasoning around the evaluation funnel for a recommendation system. Our contribution is twofold. First we present our framework for how to decompose the definition of success to construct efficient evaluation funnels, focusing on how to identify and discard non-successful iterations quickly. We show that decomposing the definition of success into smaller necessary criteria for success enables early identification of non-successful ideas. Second, we give an overview of the most common and useful evaluation methods, discuss their pros and cons, and how they fit into, and complement each other in, the evaluation process. We go through so-called offline and online evaluation methods such as counterfactual logging, validation, verification, A/B testing, and interleaving. The paper concludes with some general discussion and advice on how to design an efficient evaluation process for recommender systems.
- A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In Research and Advanced Technology for Digital Libraries: 19th International Conference on Theory and Practice of Digital Libraries, TPDL 2015, Poznań, Poland, September 14-18, 2015, Proceedings 19, pages 153–168. Springer, 2015.
- Interleaved online testing in large-scale systems. In Companion Proceedings of the ACM Web Conference 2023, pages 921–926, 2023.
- Barry W. Boehm. Verifying and validating software requirements and design specifications. IEEE software, 1(1):75, 1984.
- On the experimental attainment of optimum conditions. In Breakthroughs in statistics: methodology and distribution, pages 270–310. Springer, 1992.
- A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
- Retrieval evaluation with incomplete information. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 25–32, 2004.
- Offline evaluation options for recommender systems. Information Retrieval Journal, 23(4):387–410, 2020.
- Offline recommender system evaluation: Challenges and new directions. AI Magazine, 43(2):225–238, 2022.
- A comparative analysis of interleaving methods for aggregated search. ACM Transactions on Information Systems (TOIS), 33(2):1–38, 2015.
- William G Cochran. Analysis of covariance: Its nature and uses. Biometrics, 13(3):261–281, 1957.
- Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 123–132, 2013.
- Offline retrieval evaluation without evaluation metrics, 2022.
- Online multi-armed bandits with adaptive inference. Advances in Neural Information Processing Systems, 34:1939–1951, 2021.
- Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, pages 39–50, 2023.
- Conditional bias of point estimates following a group sequential test. Journal of Biopharmaceutical Statistics, 14(2):505–530, 2004.
- Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
- Offline and online evaluation of news recommender systems at swissinfo. ch. In Proceedings of the 8th ACM Conference on Recommender systems, pages 169–176, 2014.
- Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 9(6):641–651, 2014.
- Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198–206, 2018.
- Time-uniform, nonparametric, nonasymptotic confidence sequences. 2021.
- Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
- Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912.
- Group sequential methods with applications to clinical trials. CRC Press, 1999.
- Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142, 2002.
- Thorsten Joachims et al. Evaluating retrieval performance using clickthrough data., 2003.
- Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 786–794, 2012.
- An experimental design for anytime-valid causal inference on multi-armed bandits. arXiv preprint arXiv:2311.05794, 2023.
- Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40(5):504–534, 2021.
- How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM conference on Information and knowledge management, pages 43–52, 2008.
- Risk-aware product decisions in a/b tests with multiple metrics, 2024.
- Multileaved comparisons for fast online evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 71–80, 2014.
- Predicting search satisfaction metrics with interleaved comparisons. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 463–472, 2015.
- Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
- Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
- Charles Spearman. The proof and measurement of association between two things. 1961.
- Maria Stone. Understanding and Evaluating Search Experience. Springer Nature, 2022.
- Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621, 2023.
- Stefan H Thomke. Experimentation works: The surprising power of business experiments. Harvard Business Press, 2020.
- On heavy-user bias in a/b testing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2425–2428, 2019.
- Time-uniform central limit theory, asymptotic confidence sequences, and anytime-valid causal inference. arXiv preprint arXiv:2103.06476, 2021.
- Design-based confidence sequences: A general approach to risk mitigation in online experimentation. Harvard Business School Technology & Operations Mgt. Unit Working Paper, (23-070), 2023.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.