AutoOffAB: Toward Automated Offline A/B Testing for Data-Driven Requirement Engineering (2312.10624v2)
Abstract: Software companies have widely used online A/B testing to evaluate the impact of a new technology by offering it to groups of users and comparing it against the unmodified product. However, running online A/B testing needs not only efforts in design, implementation, and stakeholders' approval to be served in production but also several weeks to collect the data in iterations. To address these issues, a recently emerging topic, called "Offline A/B Testing", is getting increasing attention, intending to conduct the offline evaluation of new technologies by estimating historical logged data. Although this approach is promising due to lower implementation effort, faster turnaround time, and no potential user harm, for it to be effectively prioritized as requirements in practice, several limitations need to be addressed, including its discrepancy with online A/B test results, and lack of systematic updates on varying data and parameters. In response, in this vision paper, I introduce AutoOffAB, an idea to automatically run variants of offline A/B testing against recent logging and update the offline evaluation results, which are used to make decisions on requirements more reliably and systematically.
- Controlled experimentation in continuous experimentation: Knowledge and challenges. Information and Software Technology 134 (2021), 106551.
- Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research 14, 11 (2013).
- The online controlled experiment lifecycle. IEEE Software 37, 2 (2018), 60–67.
- The RIGHT model for continuous experimentation. Journal of Systems and Software 123 (2017), 292–305.
- Development and deployment at facebook. IEEE Internet Computing 17, 4 (2013), 8–17.
- Brian Fitzgerald and Klaas-Jan Stol. 2017. Continuous software engineering: A roadmap and agenda. Journal of Systems and Software 123 (2017), 176–189.
- Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 198–206.
- Offline evaluation to make decisions about playlistrecommendation algorithms. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 420–428.
- Thorsten Joachims and Adith Swaminathan. 2016. Counterfactual evaluation and learning for search, recommendation and ad placement. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 1199–1201.
- Trustworthy online controlled experiments: A practical guide to a/b testing. Cambridge University Press.
- Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining. 297–306.
- Toward data-driven requirements engineering. IEEE software 33, 1 (2015), 48–54.
- Niko Pajkovic. 2022. Algorithms and taste-making: Exposing the Netflix Recommender System’s operational logics. Convergence 28, 1 (2022), 214–235.
- Agnė Reklaitė and Jevgenij Gamper. 2022. Offline assessment of interference effects in a series of AB tests. In Proceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering. 262–263.
- A Hybrid Approach for Offline A/B Evaluation for Item Ranking Algorithms in Recommendation Systems. In Proceedings of the First International Conference on AI-ML Systems. 1–6.
- Adith Swaminathan and Thorsten Joachims. 2015. The self-normalized estimator for counterfactual learning. advances in neural information processing systems 28 (2015).
- Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. 17–26.
- Bradley C Turnbull. 2019. Learning Intent to Book Metrics for Airbnb Search. In The World Wide Web Conference. 3265–3271.
- From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2227–2236.