SH-Bench: Distributed Stream Shuffling Benchmark
- SH-Bench is an open-source benchmark that isolates shuffling and routing challenges in distributed stream processing, targeting state-local aggregations.
- It employs a configurable MapReduce-style dataflow to precisely measure throughput, latency, and scalability across frameworks like Flink, Kafka Streams, Hazelcast, and Spark.
- Experimental results highlight distinctive trade-offs, with Flink excelling in throughput, Hazelcast offering lower latency, and Kafka Streams balancing performance metrics.
SH-Bench (ShuffleBench) is an open-source, domain-independent benchmark designed to evaluate and compare distributed stream processing frameworks under workloads where the primary function is large-scale data shuffling—specifically, the re-partitioning of data streams to support state-local aggregations across numerous independent “black-box” consumers. The benchmark emphasizes shuffling and routing, as opposed to domain-specific logic, and derives its design from the requirements of modern cloud observability platforms that must route massive volumes of high-velocity events to many concurrent stateful queries, each consuming a small relevant subset of the data. SH-Bench provides a configurable, reproducible method for assessing latency, throughput, and scalability of streaming systems, thus addressing a gap left by prior benchmarks that model more complex application logic or vertical-specific pipelines (Henning et al., 7 Mar 2024).
1. Motivation and Targeted Use Cases
SH-Bench was motivated by the operational needs seen in large cloud observability infrastructures, where millions of real-time queries, anomaly detectors, or alerts each maintain state and require timely delivery of matching events. In these settings, the dominant challenge is not the implementation of complex data analytics, but rather the efficient, reliable, and elastic redistribution of stream records to the appropriate state-local consumers. Existing benchmarks tend to focus on tightly-coupled analytics or domain-specific topologies, and do not adequately stress-test the essential shuffling abstraction in isolation.
By providing a minimalist, highly configurable workload centered on routing (flatMap/matcher), re-partitioning (groupBy/keyBy), and per-key aggregation with minimal “black-box” logic, SH-Bench isolates the central scaling and coordination bottlenecks of modern distributed stream processing, making it applicable to a wide range of industrial and academic settings beyond observability.
2. Benchmark Architecture and Workload Specification
SH-Bench models its workload as a continuous MapReduce-style dataflow, realized as follows:
- Source ingestion: Records are ingested from a partitioned Kafka topic.
- Matching: Each input record passes through a matcher (flatMap), which applies a user-configurable ruleset to possibly emit zero or more duplicates , each tagged for a specific query key .
- Shuffling: Records are re-partitioned by key using groupBy/keyBy, ensuring all events for a given query are routed to the same processing instance.
- Stateful aggregation: A consumer function maintains configurable per-query state (e.g., counters, timestamps, checksums) and can emit alert events upon meeting application-defined conditions.
- Output: Processed records are emitted to a Kafka output topic.
A separate distributed load generator produces records of configurable size and rate, writing to the Kafka input. The pipeline and all configuration parameters are designed for replication across multiple application instances (Kubernetes pods), providing systematic control over variables such as parallelism, number of query keys, record sizes, state cardinality, and selectivity distributions.
Textual schematic of a single pipeline instance:
Kafka(input) → flatMap(matcher) → groupBy(key) → aggregate(consumer, state) → Kafka(output)
3. Metrics and Evaluation Methodology
SH-Bench evaluates frameworks along three primary axes: throughput, latency, and scalability, using definitions informed by performance engineering best practices.
- Throughput: Measured in records/sec. Ad-hoc throughput () is the observed rate at which records are read from Kafka under a fixed-generation rate. Sustainable throughput () is the maximum input rate for which the system maintains stable backlogs (i.e., input offset delta over time does not trend upward), determined by progressively increasing generator rate until Kafka consumption lags.
- End-to-End Latency: Calculated as , where is the record's input timestamp and is its timestamp upon egress at the sink. Histograms and percentiles are reported.
- Scalability: The benchmark employs the two-dimensional search from Theodolite to quantify both resource demand for a given load and maximum load under fixed resources, producing standard scaling curves.
Configurable parameters are grouped as follows:
| Domain Parameters | System Parameters |
|---|---|
| Record rate, size | Application instances (pods) |
| Number of consumers | Cores/memory per instance |
| Selectivity | Kafka partitions |
| State size | Framework-specific tuning |
| Consumer output freq |
4. Implementation and Framework Integration
ShuffleBench is implemented as open-source software and utilizes the following components:
- Kubernetes for orchestration and resource management.
- Theodolite for declarative benchmark specification, deployment automation, and result collection.
- A shared Java library implements matcher and consumer logic, ensuring functional equivalence across framework adapters.
- Framework adapters for:
- Apache Flink (DataStream API)
- Hazelcast Jet (pipeline API)
- Kafka Streams (Processor API)
- Spark Structured Streaming (Scala/Java API)
- Distributed load generator (Java) and latency exporter aggregate metrics from Kafka sinks.
This architecture guarantees fairness in comparisons, as all frameworks execute the identical operator logic and state handling.
5. Experimental Results and Comparative Analysis
A reference evaluation was conducted on AWS EKS with a default configuration of 9 application instances (parallelism = 9), 1 million query keys, 20% total selectivity, 1KB records, and a 10-record emit frequency.
Key empirical findings include:
- Throughput (sustainable, ): Flink (~950k rec/s) > Kafka Streams (~650k rec/s) > Hazelcast (~200k rec/s). Spark achieves ~270k rec/s only with large micro-batches (≥1M records), incurring significant latency.
- Latency (95th percentile, 90k rec/s): Hazelcast (≈8 ms) < Flink (≈88 ms) < Kafka Streams (≈183 ms) << Spark (>10 s).
- Impact of Configuration:
- Increasing per-pod cores marginally benefits Flink, Hazelcast, Kafka Streams; Spark performance degrades.
- Smaller record sizes increase throughput, particularly for Hazelcast.
- Fewer consumers yield higher throughput in Flink/Kafka Streams, not Hazelcast.
- Selectivity variation affects Flink/Kafka Streams throughput but not Hazelcast; latency non-monotonic due to batching artifacts.
Ad-hoc throughput was found to overestimate sustainable throughput (by up to 20% for Flink and Kafka Streams), highlighting the necessity of sustainable metrics for production relevance.
6. Limitations and Planned Extensions
Current limitations of SH-Bench include its focus on throughput and latency, absence of explicit reliability/fault-tolerance metrics, and evaluation within a single cloud provider environment. Planned extensions encompass:
- Benchmarking of recovery time and state consistency under failure scenarios.
- Exhaustive exploration of framework-specific tuning parameters (e.g., Flink buffer sizes, Kafka Streams commit intervals, Spark continuous processing modes).
- Cross-cloud and bare-metal studies for robust generalization.
- Introduction of skewed/variable selectivity and record sizes to simulate realistic hotspotting and consumer diversity.
7. Significance and Applications
SH-Bench enables rigorous, reproducible comparison of distributed stream processing frameworks—facilitating both industrial selection for high-scale routing requirements and academic research into novel shuffling and aggregation strategies. Empirical data demonstrates that while Flink offers superior raw throughput, Hazelcast provides the lowest end-to-end latency, and Kafka Streams occupies a middle ground; Spark, operating on large micro-batches, incurs much higher latency for comparable throughput (Henning et al., 7 Mar 2024). This evidence directly supports both engineering decisions in production deployments and future research into architectural and algorithmic optimizations for scalable, low-latency stream processing.