Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Informed and Assessable Observability Design Decisions in Cloud-native Microservice Applications (2403.00633v2)

Published 1 Mar 2024 in cs.SE

Abstract: Observability is important to ensure the reliability of microservice applications. These applications are often prone to failures, since they have many independent services deployed on heterogeneous environments. When employed "correctly", observability can help developers identify and troubleshoot faults quickly. However, instrumenting and configuring the observability of a microservice application is not trivial but tool-dependent and tied to costs. Architects need to understand observability-related trade-offs in order to weigh between different observability design alternatives. Still, these architectural design decisions are not supported by systematic methods and typically just rely on "professional intuition". In this paper, we argue for a systematic method to arrive at informed and continuously assessable observability design decisions. Specifically, we focus on fault observability of cloud-native microservice applications, and turn this into a testable and quantifiable property. Towards our goal, we first model the scale and scope of observability design decisions across the cloud-native stack. Then, we propose observability metrics which can be determined for any microservice application through so-called observability experiments. We present a proof-of-concept implementation of our experiment tool OXN. OXN is able to inject arbitrary faults into an application, similar to Chaos Engineering, but also possesses the unique capability to modify the observability configuration, allowing for the assessment of design decisions that were previously left unexplored. We demonstrate our approach using a popular open source microservice application and show the trade-offs involved in different observability design decisions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Studying the effectiveness of application performance management (apm) tools for detecting performance regressions for web applications: An experience report. In Proc. of the 13th Intl. Conf. on Mining Software Repositories, page 1–12, 2016.
  2. Chaos engineering. IEEE Software, 33(3):35–41, 2016.
  3. Benchfoundry: A benchmarking framework for cloud storage services. In Intl. Conf. on Service-Oriented Computing, pages 314–330, 2017.
  4. Cloud Service Benchmarking: Measuring Quality of Cloud Services from a Client Perspective. Springer, Cham, 2017.
  5. Faaster troubleshooting - evaluating distributed tracing approaches for serverless applications. In 2021 IEEE Intl. Conf. on Cloud Engineering, pages 83–90, 2021.
  6. An empirical evaluation of the energy and performance overhead of monitoring tools on docker-based systems. In Intl. Conf. on Service-Oriented Computing, pages 181–196, 2023.
  7. Offline trace generation for microservice observability. In IEEE 25th Intl. Enterprise Distributed Object Computing Workshop, pages 308–317, 2021.
  8. Microservices, 2014.
  9. Google Architecture Center. Devops measurement: Monitoring and observability, 2023.
  10. Decision guidance models for microservice monitoring. In IEEE Intl. Conf. on Software Architecture Workshops, pages 54–61, 2017.
  11. Gremlin: Systematic resilience testing of microservices. In 2016 IEEE 36th Intl. Conf. on Distributed Computing Systems, pages 57–66, 2016.
  12. Open tracing tools: Overview and critical comparison, 2022.
  13. Canopy: An end-to-end performance tracing and analysis system. In Proc. of the 26th Symp. on Operating Systems Principles, page 34–50, 2017.
  14. Model-driven observability for big data storage. In 2016 13th Working IEEE/IFIP Conference on Software Architecture (WICSA), pages 134–139, 2016.
  15. Costradamus: A cost-tracing system for cloud-based software services. In Intl. Conf. on Service-Oriented Computing, pages 657–672, 2017.
  16. Enjoy your observability: an industrial survey of microservice tracing and analysis. Empirical Software Engineering, 27(1):1–28, 2022.
  17. Service mesh: Challenges, state of the art, and future research opportunities. In 2019 IEEE Intl. Conf. on Service-Oriented System Engineering, pages 122–1225, 2019.
  18. Service-level fault injection testing. In ACM Symp. on Cloud Computing, page 388–402, 2021.
  19. Multi-source distributed system data for ai-powered analytics. In Eur. Conf. on Service-Oriented and Cloud Computing, pages 161–176, 2020.
  20. On observability and monitoring of distributed systems – an industry interview study. In Intl. Conf. on Service-Oriented Computing, pages 36–52, 2019.
  21. Synthetic runtime monitoring of microservices software architecture. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), volume 02, pages 448–453, 2018.
  22. Overhead comparison of opentelemetry, inspectit and kieker. In Symp. on Software Performance, 2021.
  23. Principled workflow-centric tracing of distributed systems. In ACM Symp. on Cloud Computing, SoCC ’16, page 401–414, 2016.
  24. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google Research, 2010.
  25. Cloudbench: Experiment automation for cloud environments. In Intl. Conf. on Cloud Engineering, pages 302–311, 2013.
  26. Observability and chaos engineering on system calls for containerized applications in docker. Future Generation Computer Systems, 122:117–129, 2021.
  27. Devops service observability by-design: Experimenting with model-view-controller. In Eur. Conf. on Service-Oriented and Cloud Computing, pages 49–64, 2018.
  28. Designing microservice systems using patterns: An empirical study on quality trade-offs. In 2022 IEEE 19th International Conference on Software Architecture (ICSA), pages 69–79, 2022.
  29. Microservice architecture in reality: An industrial inquiry. In 2019 IEEE International Conference on Software Architecture (ICSA), pages 51–60, 2019.
  30. Maximizing error injection realism for chaos engineering with system calls. IEEE Transactions on Dependable and Secure Computing, 19(4):2695–2708, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com