Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pareto optimal proxy metrics (2307.01000v2)

Published 3 Jul 2023 in stat.ME and cs.LG

Abstract: North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, 2019.
  2. GPareto: An R package for gaussian-process-based multi-objective optimization and analysis. Journal of Statistical Software, 89(8):1–30, 2019.
  3. Estimating uncertainty for massive data streams. Technical report, Google, 2012.
  4. Data + intuition: A hybrid approach to developing product north star metrics. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, page 617–625, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee.
  5. Alex Deng. Metric Sensitivity Decomposition. Causal Inference and Its Applications in Online Industry. https://alexdeng.github.io/causal/sensitivity.html#metric-sensitivity-decomposition. [Online; accessed 21-December-2022].
  6. Data-driven metric development for online controlled experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 77–86, 2016.
  7. Measuring metrics. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 429–437, 2016.
  8. Online experimentation with surrogate metrics: Guidelines and a case study. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. ACM, mar 2021.
  9. Online experimentation with surrogate metrics: Guidelines and a case study. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 193–201, 2021.
  10. Surrogacy marker paradox measures in meta-analytic settings. Biostatistics, 16(2):400–412, 2015.
  11. Hypervolume-based expected improvement: Monotonicity properties and exact computation. In 2011 IEEE Congress of Evolutionary Computation (CEC), pages 2147–2154, 2011.
  12. A locally-biased form of the direct algorithm. Technical report, North Carolina State University. Center for Research in Scientific Computation, 2000.
  13. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pages 1487–1495. ACM, 2017.
  14. Focus on the long-term: It’s better for users and business. In Proceedings 21st Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 2015.
  15. Steven G. Johnson. The NLopt nonlinear-optimization package. https://github.com/stevengj/nlopt, 2007.
  16. Ross L. Prentice. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine, 8(4):431–440, 1989.
  17. Beyond power analysis: Metric sensitivity analysis in A/B tests. https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/beyond-power-analysis-metric-sensitivity-in-a-b-tests/.
  18. Lenny Rachitsky. Choosing Your North Star Metric. https://future.com/north-star-metrics/.
  19. Open source vizier: Distributed infrastructure and api for reliable and flexible black-box optimization. In Automated Machine Learning Conference, Systems Track (AutoML-Conf Systems), 2022.
  20. Multi-objective bayesian global optimization using expected hypervolume improvement gradient. Swarm and Evolutionary Computation, 44:945–956, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.