An Online Probabilistic Distributed Tracing System (2405.15645v1)

Published 24 May 2024 in cs.PF and cs.DC

Abstract: Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time.

References (50)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/HPCPapers/status/1794973515805819104

An Online Probabilistic Distributed Tracing System (2405.15645v1)

Summary

Related Papers

Tweets