CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows (2403.13629v1)
Abstract: Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtually all stream processors that guarantee exactly-once processing implement a variant of Apache Flink's coordinated checkpoints - an extension of the original Chandy-Lamport checkpoints from 1985. However, the reasons behind this prevalence of the coordinated approach remain anecdotal, as reported by practitioners of the stream processing community. At the same time, common checkpointing approaches, such as the uncoordinated and the communication-induced ones, remain largely unexplored. This paper is the first to address this gap by i) shedding light on why practitioners have favored the coordinated approach and ii) by investigating whether there are viable alternatives. To this end, we implement three checkpointing approaches that we surveyed and adapted for the distinct needs of streaming dataflows. Our analysis shows that the coordinated approach outperforms the uncoordinated and communication-induced protocols under uniformly distributed workloads. To our surprise, however, the uncoordinated approach is not only competitive to the coordinated one in uniformly distributed workloads, but it also outperforms the coordinated approach in skewed workloads. We conclude that rather than blindly employing coordinated checkpointing, research should focus on optimizing the very promising uncoordinated approach, as it can address issues with skew and support prevalent cyclic queries. We believe that our findings can trigger further research into checkpointing mechanisms.
- https://flink.apache.org/2020/10/15/from-aligned-to-unaligned-checkpoints-part-1-checkpoints-alignment-and-backpressure/. Accessed: 2023-11-20.
- Improving Speed and Stability of Checkpointing […]. https://www.alibabacloud.com/blog/599048. Accessed: 2023-11-20.
- Nexmark benchmark suite. https://beam.apache.org/documentation/sdks/java/testing/nexmark/. Accessed: 2023-11-20.
- Optimize checkpointing in your Amazon Managed Service for Apache Flink applications. https://aws.amazon.com/blogs/big-data/part-1-optimize-checkpointing-in-your-amazon-managed-service-for-apache-flink-applications-with-buffer-debloating-and-unaligned-checkpoints/. Accessed: 2023-11-20.
- Stateful Stream Processing. https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/concepts/stateful-stream-processing/#exactly-once-vs-at-least-once. Accessed: 2023-11-20.
- The design of the borealis stream processing engine. volume 5, pages 277–289, 01 2005.
- MillWheel: Fault-Tolerant Stream Processing at Internet Scale. Proceedings of the VLDB Endowment, 6(11):1033–1044, 2013.
- The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow., 8(12):1792–1803, 2015.
- Citybench: A configurable benchmark to evaluate rsp engines using smart city datasets. In Marcelo Arenas, Oscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d’Aquin, Kavitha Srinivas, Paul Groth, Michel Dumontier, Jeff Heflin, Krishnaprasad Thirunarayan, and Steffen Staab, editors, The Semantic Web - ISWC 2015, pages 374–389, Cham, 2015. Springer International Publishing.
- An analysis of communication induced checkpointing. In Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352), pages 242–249, 1999.
- STREAM: the stanford stream data manager. IEEE Data Eng. Bull., 26(1):19–26, 2003.
- Linear road: A stream data management benchmark. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, page 480–491. VLDB Endowment, 2004.
- Fault Tolerance and High Availability in Data Stream Management Systems, pages 1–8. 01 2017.
- B. Bhargava and Shu-Renn Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach. In Proceedings [1988] Seventh Symposium on Reliable Distributed Systems, pages 3–12, 1988.
- Independent checkpointing and concurrent rollback for recovery in distributed system—an optimistic approach. 1987.
- A distributed domino-effect free recovery algorithm. In Fourth Symposium on Reliability in Distributed Software and Database Systems, SRDS 1984, Silver Spring, Maryland, USA, October 15-17, 1984, Proceedings, pages 207–215. IEEE Computer Society, 1984.
- Guohong Cao and M. Singhal. On coordinated checkpointing in distributed systems. IEEE Transactions on Parallel and Distributed Systems, 9(12):1213–1225, 1998.
- State management in apache flink®: Consistent stateful distributed stream processing. Proc. VLDB Endow., 10(12):1718–1729, aug 2017.
- Apache flink™: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 38, 01 2015.
- Trill: Engineering a library for diverse analytics. IEEE Data Eng. Bull., 38(4):51–60, 2015.
- On-the-fly progress detection in iterative stream queries. Proc. VLDB Endow., 2(1):241–252, aug 2009.
- Telegraphcq: Continuous dataflow processing for an uncertain world. In First Biennial Conference on Innovative Data Systems Research, CIDR 2003, Asilomar, CA, USA, January 5-8, 2003, Online Proceedings. www.cidrdb.org, 2003.
- Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst., 3(1):63–75, feb 1985.
- Scalable distributed stream processing. 01 2003.
- Consistent regions: Guaranteed tuple processing in IBM streams. Proc. VLDB Endow., 9(13):1341–1352, 2016.
- How to recover efficiently and asynchronously when optimism fails. In Proceedings of 16th International Conference on Distributed Computing Systems, pages 108–115. IEEE, 1996.
- A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34, 06 2002.
- Integrating scale out and fault tolerance in stream processing using operator state management. pages 725–736, 06 2013.
- A survey on the evolution of stream processing systems. VLDB Journal, 2023.
- Hazelcast jet: Low-latency stream processing at the 99.99th percentile. Proc. VLDB Endow., 14(12):3110–3121, jul 2021.
- Communication-based prevention of useless checkpoints in distributed computations. Distributed Comput., 13(1):29–43, 2000.
- High-availability algorithms for distributed stream processing. In 21st International Conference on Data Engineering (ICDE’05), pages 779–790, 2005.
- Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 783–798, Carlsbad, CA, October 2018. USENIX Association.
- Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1507–1518, 2018.
- Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, New York, NY, USA, 2015. Association for Computing Machinery.
- Stream bench: Towards benchmarking modern distributed stream computing frameworks. In Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, UCC ’14, page 69–78, USA, 2014. IEEE Computer Society.
- Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, page 439–455, New York, NY, USA, 2013. Association for Computing Machinery.
- Regular path query evaluation on streaming graphs. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD ’20, page 1415–1430, New York, NY, USA, 2020. Association for Computing Machinery.
- Stateful entities: Object-oriented cloud applications as distributed dataflows. In EDBT, 2024.
- Styx: Transactional stateful functions on streaming dataflows, 2024.
- Benchmarking modern distributed streaming platforms. In 2016 IEEE International Conference on Industrial Technology (ICIT), pages 592–598, 2016.
- Riotbench: An iot benchmark for distributed stream processing systems. Concurrency and Computation: Practice and Experience, 29(21):e4257, 2017. e4257 cpe.4257.
- Towards evaluating stream processing autoscalers. In 2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW), pages 95–99, Los Alamitos, CA, USA, apr 2023. IEEE Computer Society.
- Clonos: Consistent causal recovery for highly-available streaming dataflows. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD ’21, page 1637–1650, New York, NY, USA, 2021. Association for Computing Machinery.
- Optimistic recovery in distributed systems. ACM Trans. Comput. Syst., 3(3):204–226, aug 1985.
- Nexmark–a benchmark for queries over data streams (draft). Technical report, 2008.
- Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 6(5):546–554, 1995.