Papers
Topics
Authors
Recent
Search
2000 character limit reached

Benchmarking Distributed Stream Data Processing Systems

Published 23 Feb 2018 in cs.DB | (1802.08496v2)

Abstract: The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems' performance characteristics. In this paper, we propose a framework for benchmarking distributed stream processing engines. We use our suite to evaluate the performance of three widely used SDPSs in detail, namely Apache Storm, Apache Spark, and Apache Flink. Our evaluation focuses in particular on measuring the throughput and latency of windowed operations, which are the basic type of operations in stream analytics. For this benchmark, we design workloads based on real-life, industrial use-cases inspired by the online gaming industry. The contribution of our work is threefold. First, we give a definition of latency and throughput for stateful operators. Second, we carefully separate the system under test and driver, in order to correctly represent the open world model of typical stream processing deployments and can, therefore, measure system performance under realistic conditions. Third, we build the first benchmarking framework to define and test the sustainable performance of streaming systems. Our detailed evaluation highlights the individual characteristics and use-cases of each system.

Citations (163)

Summary

  • The paper presents a comprehensive framework for benchmarking distributed stream processing systems, defining precise metrics like event-time latency, processing-time latency, and sustainable throughput, measured externally to ensure accuracy.
  • Experimental evaluation comparing Apache Storm, Spark, and Flink on windowed operations using real-world workloads shows Flink generally achieves higher throughput and lower latency, while Spark demonstrates robustness in handling data skew.
  • The findings underscore the importance of choosing a streaming system based on specific workload characteristics and propose an extensible benchmarking strategy for future research into system performance trade-offs.

Benchmarking Distributed Stream Data Processing Systems

The paper "Benchmarking Distributed Stream Data Processing Systems" presents a comprehensive framework for evaluating the performance characteristics of distributed stream data processing systems (SDPSs), namely Apache Storm, Apache Spark, and Apache Flink. The focus lies on assessing throughput and latency during windowed operations—a fundamental component of stream analytics—and providing a detailed comparative analysis using industry-inspired real-life workloads.

Key Contributions

The paper makes several notable contributions to the field of distributed stream data processing:

  1. Definition and Measurement of Latency and Throughput: The authors introduce precise definitions of event-time latency and processing-time latency, distinguishing between the ingestion time and the actual emission time of an event by the SDPS output operator. They emphasize the importance of measuring these metrics externally from the system under test to avoid inaccuracies in previous measurements that often led to the coordinated omission problem.
  2. Separation of Benchmark Driver and System Under Test: A significant methodological advancement is the complete separation of the benchmark driver—which includes data generation and queuing—from the system under test (SUT). This separation ensures that performance metrics reflect actual system capabilities without the confounding influence of integrated measurement systems.
  3. Sustainable Throughput: The concept of sustainable throughput is introduced, defined as the highest data traffic load a system can handle without exhibiting prolonged backpressure. This metric is crucial for understanding the real performance capabilities of SDPSs in production environments.

Experimental Evaluation

The benchmarking framework is evaluated using two types of queries based on online video gaming scenarios: windowed aggregations and windowed joins. The results provide insight into system-specific characteristics:

  • Throughput and Latency: The experimental outcomes reveal that Flink generally achieves higher throughput and lower latency compared to Storm and Spark. Flink's non-blocking operators and efficient tuple-at-a-time semantics seem to contribute to its superior performance in aggregation workloads.
  • Data Skew Handling: Spark shows robust performance in handling skewed data, outperforming other systems due to its tree reduce and tree aggregate communication patterns, which minimize network bottlenecks.
  • Window Size Influence: As window size increases, Spark's throughput notably decreases due to memory consumption issues, which are alleviated by employing strategies such as the Inverse Reduce Function to manage old data.
  • System Responsiveness to Fluctuations: Flink demonstrates better stability than Spark when subjected to fluctuating workloads, attributed to its efficient backpressure mechanism.

Implications and Future Work

The findings highlight the importance of choosing an SDPS based on specific workload characteristics such as data skew, window size, or throughput requirements. The research underlines a need for further exploration into SDPS functionality trade-offs, such as exactly-once processing or handling late data arrival while maintaining optimal throughput and latency.

The framework proposes an extensible benchmarking strategy with potential applications across various SDPS architectures, paving the way for future research to incorporate broader sets of streaming engines like Apache Samza and Heron. The authors suggest developing a generic interface to streamline benchmarking processes and explore in-depth trade-offs between system functionality and performance.

By providing a robust mechanism to accurately measure performance metrics, this paper serves as a crucial guide for researchers and practitioners aiming to comprehend the dynamics and limitations inherent in contemporary SDPSs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.