An Online Probabilistic Distributed Tracing System (2405.15645v1)
Abstract: Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time.
- Akamai, “Akamai online retail performance report: Milliseconds are critical..” https://www.akamai.com/uk/en/about/news/press/2017-press/akamai-releases-spring-2017-state-of-online-retail-performance-report.jsp, Apr 2017.
- H. Qiu, S. S. Banerjee, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer, “FIRM: An intelligent fine-grained resource management framework for SLO-Oriented microservices,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 805–825, USENIX Association, Nov. 2020.
- J. Dean and L. A. Barroso, “The tail at scale,” Communications of the ACM, vol. 56, pp. 74–80, 2013.
- J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V. Venkataraman, K. Veeraraghavan, and Y. J. Song, “Canopy: An end-to-end performance tracing and analysis system,” SOSP, (New York, NY, USA), p. 34–50, Association for Computing Machinery, 2017.
- R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger, “Diagnosing performance changes by comparing request flows,” in 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.
- M. Toslali, S. Parthasarathy, F. Oliveira, and A. K. Coskun, “JACKPOT: Online experimentation of cloud microservices,” in 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), USENIX Association, July 2020.
- M. Toslali, S. Parthasarathy, F. Oliveira, H. Huang, and A. K. Coskun, “Iter8: Online experimentation in the cloud,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC, (New York, NY, USA), p. 289–304, Association for Computing Machinery, 2021.
- D. Ardelean, A. Diwan, and C. Erdman, “Performance analysis of cloud applications,” in 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pp. 405–417, 2018.
- R. Fonseca, G. Porter, R. H. Katz, and S. Shenker, “{{\{{X-Trace}}\}}: A pervasive network tracing framework,” in 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI), 2007.
- S. Luo, H. Xu, C. Lu, K. Ye, G. Xu, L. Zhang, Y. Ding, J. He, and C. Xu, “Characterizing microservice dependency and performance: Alibaba trace analysis,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC, (New York, NY, USA), p. 412–426, Association for Computing Machinery, 2021.
- J. Teoh, M. A. Gulzar, G. H. Xu, and M. Kim, “Perfdebug: Performance debugging of computation skew in dataflow systems,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC, (New York, NY, USA), p. 465–476, Association for Computing Machinery, 2019.
- G. Jin, L. Song, X. Shi, J. Scherpelz, and S. Lu, “Understanding and detecting real-world performance bugs,” in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, (New York, NY, USA), p. 77–88, Association for Computing Machinery, 2012.
- M. Attariyan, M. Chow, and J. Flinn, “X-ray: Automating Root-Cause diagnosis of performance anomalies in production software,” in 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), (Hollywood, CA), pp. 307–320, USENIX Association, Oct. 2012.
- D. Ardelean, A. Diwan, and C. Erdman, “Performance analysis of cloud applications,” in 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI), (Renton, WA), pp. 405–417, USENIX Association, Apr. 2018.
- L. Luo, S. Nath, L. R. Sivalingam, M. Musuvathi, and L. Ceze, “Troubleshooting Transiently-Recurring errors in production systems with Blame-Proportional logging,” in 2018 USENIX Annual Technical Conference (USENIX ATC), (Boston, MA), pp. 321–334, USENIX Association, July 2018.
- A. Maricq, D. Duplyakin, I. Jimenez, C. Maltzahn, R. Stutsman, and R. Ricci, “Taming performance variability,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), (Carlsbad, CA), pp. 409–425, USENIX Association, Oct. 2018.
- X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, W. Li, and D. Ding, “Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study,” IEEE Transactions on Software Engineering, vol. 47, no. 2, pp. 243–260, 2021.
- R. R. Sambasivan, I. Shafer, J. Mace, B. H. Sigelman, R. Fonseca, and G. R. Ganger, “Principled workflow-centric tracing of distributed systems,” in Proceedings of the Seventh ACM Symposium on Cloud Computing, pp. 401–414, 2016.
- M. Toslali, E. Ates, A. Ellis, Z. Zhang, D. Huye, L. Liu, S. Puterman, A. K. Coskun, and R. R. Sambasivan, “Automating instrumentation choices for performance problems in distributed applications with vaif,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC, (New York, NY, USA), p. 61–75, Association for Computing Machinery, 2021.
- Y. Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: Practical and scalable ml-driven performance debugging in microservices,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, (New York, NY, USA), p. 135–151, Association for Computing Machinery, 2021.
- Y. Gan, Y. Zhang, K. Hu, D. Cheng, Y. He, M. Pancholi, and C. Delimitrou, “Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, (New York, NY, USA), p. 19–33, Association for Computing Machinery, 2019.
- J. Mace, R. Roelke, and R. Fonseca, “Pivot tracing: Dynamic causal monitoring for distributed systems,” in Proceedings of the 25th Symposium on Operating Systems Principles, pp. 378–393, 2015.
- P. Las-Casas, G. Papakerashvili, V. Anand, and J. Mace, “Sifter: Scalable sampling for distributed traces, without feature engineering,” in Proceedings of the ACM Symposium on Cloud Computing, pp. 312–324, 2019.
- B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure,” 2010.
- “Jaeger, open source, end-to-end distributed tracing.” https://www.jaegertracing.io/.
- J. Mace and R. Fonseca, “Universal context propagation for distributed system instrumentation,” in Proceedings of the Thirteenth EuroSys Conference, EuroSys, (New York, NY, USA), Association for Computing Machinery, 2018.
- Z. Zhang, J. Zhan, Y. Li, L. Wang, D. Meng, and B. Sang, “Precise request tracing and performance debugging for multi-tier services of black boxes,” in IEEE/IFIP International Conference on Dependable Systems & Networks, pp. 337–346, IEEE, 2009.
- R. Ding, H. Zhou, J.-G. Lou, H. Zhang, Q. Lin, Q. Fu, D. Zhang, and T. Xie, “Log2: A Cost-Aware logging mechanism for performance diagnosis,” in 2015 USENIX Annual Technical Conference (USENIX ATC), (Santa Clara, CA), pp. 139–150, USENIX Association, July 2015.
- L. Zhang, V. Anand, Z. Xie, Y. Vigfusson, and J. Mace, “The benefit of hindsight: Tracing edge-cases in distributed systems,” 2022.
- “Distributed Tracing in Practice.” https://go.lightstep.com/rs/260-KGM-472/images/distributed-tracing-in-practice-lightstep.pdf.
- Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou, “An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, (New York, NY, USA), p. 3–18, Association for Computing Machinery, 2019.
- X. Zhao, K. Rodrigues, Y. Luo, M. Stumm, D. Yuan, and Y. Zhou, “Log20: Fully automated optimal placement of log printing statements under specified overhead threshold,” in Proceedings of the 26th Symposium on Operating Systems Principles, SOSP, (New York, NY, USA), p. 565–581, Association for Computing Machinery, 2017.
- R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press, second ed., 2018.
- W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, pp. 285–294, 12 1933.
- Redis cluster issue. https://github.com/sewenew/redis-plus-plus/issues/212, 2022.
- “Redis++ currently does not support pipeline with multiple hashtags in cluster mode .” https://github.com/delimitrou/DeathStarBench/blob/66eed9b6f9fd56fbdbd750f18ba3760899c566d3/socialNetwork/src/HomeTimelineService/HomeTimelineHandler.h\#L127.
- L. Huang and T. Zhu, “Tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces,” in Proceedings of the ACM Symposium on Cloud Computing, SoCC, (New York, NY, USA), p. 76–91, Association for Computing Machinery, 2021.
- M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch, “The mystery machine: End-to-end performance analysis of large-scale internet services,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 217–231, 2014.
- Y. Wu, A. Chen, and L. T. X. Phan, “Zeno: Diagnosing performance problems with temporal provenance,” in 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI), (Boston, MA), pp. 395–420, USENIX Association, Feb. 2019.
- O’Reilly Media, 2020.
- Z. Zhang, M. K. Ramanathan, P. Raj, A. Parwal, T. Sherwood, and M. Chabbi, “CRISP: Critical path analysis of Large-Scale microservice architectures,” in 2022 USENIX Annual Technical Conference (USENIX ATC), (Carlsbad, CA), pp. 655–672, USENIX Association, July 2022.
- “OpenTelemetry.” https://opentelemetry.io/docs/migration/opentracing/.
- M. H. DeGroot and M. J. Schervish, Probability and Statistics. Addison-Wesley, 3 ed., 2002.
- “Jaeger CPP client library.” https://github.com/jaegertracing/jaeger-client-cpp/.
- “Opentracing Java spring framework.” https://github.com/opentracing-contrib/java-spring-jaeger.
- “Jaeger Java client library.” https://github.com/jaegertracing/jaeger-client-java.
- Hey-workload. https://github.com/rakyll/hey, 2022.
- Wrk2-workload. https://github.com/giltene/wrk2, 2022.
- “Pumba: chaos testing tool for Docker.” https://github.com/alexei-led/pumba.
- “Stress-ng.” https://wiki.ubuntu.com/Kernel/Reference/stress-ng.