Admission Control with Response Time Objectives for Low-latency Online Data Systems (2312.15123v1)
Abstract: To provide quick responses to users, Internet companies rely on online data systems able to answer queries in milliseconds. These systems employ complementary overload management techniques to ensure they provide a continued, acceptable service through-out traffic surges, where 'acceptable' partly means that serviced queries meet or track closely their response time objectives. Thus, in this paper we present Bouncer, an admission control policy aimed to keep admitted queries under or near their service level objectives (SLOs) on percentile response times. It computes inexpensive estimates of percentile response times for every incoming query and compares the estimates against the objective values to decide whether to accept or reject the query. Bouncer allows assigning separate SLOs to different classes of queries in the workload, implements early rejections to let clients react promptly and to help data systems avoid doing useless work, and complements other load shedding policies that guard systems from exceeding their capacity. Moreover, we propose two starvation avoidance strategies that supplement Bouncer's basic formulation and prevent query types from receiving no service (starving). We evaluate Bouncer and its starvation-avoiding variants against other policies in simulation and on a production-grade in-memory distributed graph database. Our results show that Bouncer and its variants allow admitted queries to meet or stay close to the SLOs when the other policies do not. They also report fewer overall rejections, a small overhead, and with the given latency SLOs, they let the system reach high utilization. In addition, we observe that the proposed strategies can stop query starvation, but at the expense of a modest increase in overall rejections and causing SLO violations for serviced requests.
- Foundations of Databases. Addison-Wesley.
- BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29–42.
- Bryan Barkley. 2022. Hodor: Detecting and addressing overload in LinkedIn microservices. https://engineering.linkedin.com/blog/2022/hodor--detecting-and-addressing-overload-in-linkedin-microservic. (February 2022). [Accessed: Feb 2023].
- Self-* through Self-Learning: Overload Control for Distributed Web Systems. Computer Networks 53, 5 (April 2009), 727–743.
- Quorum: Flexible Quality of Service for Internet Services. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation. 159–174.
- TAO: Facebook’s Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Annual Technical Conference. 49–60.
- Nanosecond Indexing of Graph Data With Hash Maps and VLists. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). ACM, 623–635.
- DARLING: Data-Aware Load Shedding in Complex Event Processing Systems. Proceedings of the VLDB Endowment 15, 3 (2022), 541–554.
- Huamin Chen and Prasant Mohapatra. 2002. Session-based Overload Control in QoS-aware Web Servers. In Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 2. 516–524.
- Ludmila Cherkasova. 1998. Scheduling Strategy to Improve Response Time for Web Applications. In High-Performance Computing and Networking. Springer Berlin Heidelberg, 305–314.
- Ludmila Cherkasova and Peter Phaal. 1998. Session Based Admission Control: A Mechanism for Improving the Performance of an Overloaded Web Server. Technical Report HPL-98-119. Computer Systems Laboratory. Hewlett-Packard.
- Overload Control for μ𝜇\muitalic_μs-scale RPCs with Breakwater. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation. 299–314.
- Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 430––446.
- Scheduling Algorithms for Distributed Web Servers. In Proceedings of 17th International Conference on Distributed Computing Systems. IEEE, 169–176.
- Spanner: Google’s Globally Distributed Database. ACM Transactions on Computer Systems 31, 3 (2013), 8:1–8:22.
- LinkedIn Corp. 2018. The graph team at LinkedIn. https://engineering.linkedin.com/teams/data/data-infrastructure/graph. (2018). [Accessed: Feb 2023].
- LinkedIn Corp. 2022a. LinkedIn’s Economic Graph. https://economicgraph.linkedin.com. (2022). [Accessed: Feb 2023].
- Microsoft Corp. 2022b. Azure Cosmos DB. https://azure.microsoft.com/en-us/services/cosmos-db/. (2022). [Accessed: Feb 2023].
- Microsoft Corp. 2022c. SQL Server Resource Governor. https://learn.microsoft.com/en-us/sql/relational-databases/resource-governor/resource-governor?view=sql-server-ver16. (2022). [Accessed: Feb 2023].
- Alejandro Forero Cuervo. 2017a. Handling Overload. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 21. https://sre.google/sre-book/handling-overload/.
- Alejandro Forero Cuervo. 2017b. Load Balancing in the Datacenter. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 20. https://sre.google/sre-book/load-balancing-datacenter/.
- Dynamo: Amazon’s Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles. 205–220.
- Managing Resources with Oracle Database Resource Manager. Oracle Database: Database Administrator’s Guide, 21c. Chapter 26. https://docs.oracle.com/en/database/oracle/oracle-database/21/admin/index.html.
- A Method for Transparent Admission Control and Request Scheduling in E-Commerce Web Sites. In Proceedings of the 13th International Conference on World Wide Web. ACM, 276–286.
- MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-aware OS Interface. In Proceedings of the 26th Symposium on Operating Systems Principles. 168–183.
- Hans-Ulrich Heiss and Roger Wagner. 1991. Adaptive Load Control in Transaction Processing Systems. In Proceedings of the 17th International Conference on Very Large Data Bases. 47–54.
- IBM. 2022. Db2 Adaptive workload manager. https://www.ibm.com/docs/en/db2/11.5?topic=management-adaptive-workload-manager. (2022). [Accessed: Feb 2023].
- Overload Control Mechanisms for Web Servers. In Proceedings of the International Conference on the Performance and QoS of Next Generation Networking. Springer, 225–244.
- A Measurement-based Admission Control Algorithm for Integrated Service Packet Networks. IEEE/ACM Transactions on Networking 5, 1 (1997), 56–70.
- Service Level Objectives. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 4. https://sre.google/sre-book/service-level-objectives/.
- Millions of Targets under Attack: A Macroscopic Characterization of the DoS Ecosystem. In Proceedings of the 2017 Internet Measurement Conference. ACM, 100––113.
- Eugene Kim. 2018. Internal documents show how Amazon scrambled to fix Prime Day glitches. https://www.cnbc.com/2018/07/19/amazon-internal-documents-what-caused-prime-day-crash-company-scramble.html. (2018). [Accessed: Feb 2023].
- DDoS Never Dies? An IXP Perspective on DDoS Amplification Attacks. In Proceedings of the 22nd International Conference on Passive and Active Measurement (Lecture Notes in Computer Science), Oliver Hohlfeld, Andra Lutu, and Dave Levin (Eds.), Vol. 12671. Springer, 284–301.
- Kafka: A Distributed Messaging System for Log Processing. In Proceedings of the 6th International Workshop on Networking Meets Database (NetDB’11). ACM, 1–7.
- William LeFebvre. 2001. CNN.com: Facing a World Crisis. In 15th Systems Administration Conference (LISA 2001). USENIX Association, San Diego, CA. https://www.usenix.org/conference/lisa-2001/cnncom-facing-world-crisis
- Piotr Lewandowski. 2017. Load Balancing at the Frontend. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 19. https://sre.google/sre-book/load-balancing-frontend/.
- Imprecise Computations. Proceedings of the IEEE 82, 1 (1994), 83–94.
- Anil Mallapur and Michael Kehoe. 2017. TrafficShift: Load Testing at Scale. https://engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale. (2017). [Accessed: Feb 2023].
- LIquid: The soul of a new graph database, Part 1. https://engineering.linkedin.com/blog/2020/liquid-the-soul-of-a-new-graph-database-part-1. (2020). [Accessed: Feb 2023].
- LIquid: The soul of a new graph database, Part 2. https://engineering.linkedin.com/blog/2020/liquid–the-soul-of-a-new-graph-database–part-2. (2020). [Accessed: Feb 2023].
- Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing. ACM Computing Surveys 48, 4 (May 2016).
- Axel Mönkeberg and Gerhard Weikum. 1992. Performance Evaluation of an Adaptive and Robust Load Control Method for the Avoidance of Data-Contention Thrashing. In Proceedings of the 18th International Conference on Very Large Data Bases. 432––443.
- Measurement-Based Admission Control at Edge Routers. IEEE/ACM Transactions on Networking 16, 2 (April 2008), 410–423.
- Sam Newman. 2021. Building Microservices: Designing Fine-Grained Systems (2 ed.). O’Reilly Media.
- From the Application to the CPU: Holistic Resource Management for Modern Database Management Systems. IEEE Data Engineering Bulletin 42, 1 (2019), 10–21. http://sites.computer.org/debull/A19mar/p10.pdf
- Spence Purnell. 2020. State Unemployment Websites Crash as COVID-19 Shines Light on Government Technology Failures. https://shorturl.at/BNS29. (2020). [Accessed: Feb 2023].
- Chris Richardson. 2019. Microservices Patterns: With examples in Java (1 ed.). Manning, Chapter 8, 253–291.
- SAP. 2022. Admission Control. Monitoring View. SAP HANA Administration with SAP HANA Cockpit (2.15.0 ed.). Chapter 7.5. https://help.sap.com/docs/SAP_HANA_COCKPIT/afa922439b204e9caf22c78b6b69e4f2/ce46dcceaef045cb85f6fdf694789ea0.html.
- Bianca Schroeder and Mor Harchol-Balter. 2006. Web Servers under Overload: How Scheduling Can Help. ACM Transactions on Internet Technology 6, 1 (Feb. 2006), 20–52.
- How to Determine a Good Multi-Programming Level for External Scheduling. In Proceedings of the 22nd International Conference on Data Engineering. 60–71.
- hSPICE: State-aware Event Shedding in Complex Event Processing. In Proceedings of the 14th ACM International Conference on Distributed and Event-based Systems (DEBS’20). 109–120.
- Doorman: Global Distributed Client Side Rate Limiting. https://github.com/youtube/doorman. (2016). [Accessed: Feb 2023].
- Gil Tene and others. wrk2: a HTTP benchmarking tool based mostly on wrk. https://github.com/giltene/wrk2. (????). [Accessed: Feb 2023].
- Azure DDoS Protection - 2021 Q3 and Q4 DDoS attack trends. https://azure.microsoft.com/en-us/blog/azure-ddos-protection-2021-q3-and-q4-ddos-attack-trends/. (January 2022). [Accessed: Feb 2023].
- Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In Proceedings of the IEEE 26th International Conference on Data Engineering. 397–408.
- Mike Ulrich. 2017. Addressing Cascading Failures. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media Inc., Chapter 22. https://sre.google/sre-book/addressing-cascading-failures/.
- Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 635–651.
- Matt Welsh and David Culler. 2003. Adaptive Overload Control for Busy Internet Servers. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems - Volume 4 (USITS’03). 1:1–1:15.
- ActiveSLA: A Profit-Oriented Admission Control Framework for Database-as-a-Service Providers. In Proceedings of the 2nd ACM Symposium on Cloud Computing. Article 15, 14 pages.
- Joint Admission Control and Routing via Approximate Dynamic Programming for Streaming Video Over Software-defined Networking. IEEE Transactions on Multimedia 19, 3 (2016), 619–631.
- AnalyticDB: Real-time OLAP Database System at Alibaba Cloud. Proceedings of the VLDB Endowment 12, 12 (2019), 2059–2070.
- Mingyi Zhang. 2014. Autonomic Workload Management for Database Management Systems. Ph.D. Dissertation. Queen’s University. http://hdl.handle.net/1974/12181.
- Workload Management in Database Management Systems: A Taxonomy. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1386–1402.
- Load Shedding for Complex Event Processing: Input-based and State-based Techniques. In Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE’20). 1093–1104.
- Overload Control for Scaling WeChat Microservices. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 149–161.
- Jingyu Zhou and Tao Yang. 2006. Selective Early Request Termination for Busy Internet Services. In Proceedings of the 15th International Conference on World Wide Web. ACM, 605–614.