- The paper presents a comprehensive framework for benchmarking distributed stream processing systems, defining precise metrics like event-time latency, processing-time latency, and sustainable throughput, measured externally to ensure accuracy.
- Experimental evaluation comparing Apache Storm, Spark, and Flink on windowed operations using real-world workloads shows Flink generally achieves higher throughput and lower latency, while Spark demonstrates robustness in handling data skew.
- The findings underscore the importance of choosing a streaming system based on specific workload characteristics and propose an extensible benchmarking strategy for future research into system performance trade-offs.
Benchmarking Distributed Stream Data Processing Systems
The paper "Benchmarking Distributed Stream Data Processing Systems" presents a comprehensive framework for evaluating the performance characteristics of distributed stream data processing systems (SDPSs), namely Apache Storm, Apache Spark, and Apache Flink. The focus lies on assessing throughput and latency during windowed operations—a fundamental component of stream analytics—and providing a detailed comparative analysis using industry-inspired real-life workloads.
Key Contributions
The paper makes several notable contributions to the field of distributed stream data processing:
- Definition and Measurement of Latency and Throughput: The authors introduce precise definitions of event-time latency and processing-time latency, distinguishing between the ingestion time and the actual emission time of an event by the SDPS output operator. They emphasize the importance of measuring these metrics externally from the system under test to avoid inaccuracies in previous measurements that often led to the coordinated omission problem.
- Separation of Benchmark Driver and System Under Test: A significant methodological advancement is the complete separation of the benchmark driver—which includes data generation and queuing—from the system under test (SUT). This separation ensures that performance metrics reflect actual system capabilities without the confounding influence of integrated measurement systems.
- Sustainable Throughput: The concept of sustainable throughput is introduced, defined as the highest data traffic load a system can handle without exhibiting prolonged backpressure. This metric is crucial for understanding the real performance capabilities of SDPSs in production environments.
Experimental Evaluation
The benchmarking framework is evaluated using two types of queries based on online video gaming scenarios: windowed aggregations and windowed joins. The results provide insight into system-specific characteristics:
- Throughput and Latency: The experimental outcomes reveal that Flink generally achieves higher throughput and lower latency compared to Storm and Spark. Flink's non-blocking operators and efficient tuple-at-a-time semantics seem to contribute to its superior performance in aggregation workloads.
- Data Skew Handling: Spark shows robust performance in handling skewed data, outperforming other systems due to its tree reduce and tree aggregate communication patterns, which minimize network bottlenecks.
- Window Size Influence: As window size increases, Spark's throughput notably decreases due to memory consumption issues, which are alleviated by employing strategies such as the Inverse Reduce Function to manage old data.
- System Responsiveness to Fluctuations: Flink demonstrates better stability than Spark when subjected to fluctuating workloads, attributed to its efficient backpressure mechanism.
Implications and Future Work
The findings highlight the importance of choosing an SDPS based on specific workload characteristics such as data skew, window size, or throughput requirements. The research underlines a need for further exploration into SDPS functionality trade-offs, such as exactly-once processing or handling late data arrival while maintaining optimal throughput and latency.
The framework proposes an extensible benchmarking strategy with potential applications across various SDPS architectures, paving the way for future research to incorporate broader sets of streaming engines like Apache Samza and Heron. The authors suggest developing a generic interface to streamline benchmarking processes and explore in-depth trade-offs between system functionality and performance.
By providing a robust mechanism to accurately measure performance metrics, this paper serves as a crucial guide for researchers and practitioners aiming to comprehend the dynamics and limitations inherent in contemporary SDPSs.