SH-Bench: Distributed Stream Shuffling Benchmark

Updated 11 December 2025

SH-Bench is an open-source benchmark that isolates shuffling and routing challenges in distributed stream processing, targeting state-local aggregations.
It employs a configurable MapReduce-style dataflow to precisely measure throughput, latency, and scalability across frameworks like Flink, Kafka Streams, Hazelcast, and Spark.
Experimental results highlight distinctive trade-offs, with Flink excelling in throughput, Hazelcast offering lower latency, and Kafka Streams balancing performance metrics.

SH-Bench (ShuffleBench) is an open-source, domain-independent benchmark designed to evaluate and compare distributed stream processing frameworks under workloads where the primary function is large-scale data shuffling—specifically, the re-partitioning of data streams to support state-local aggregations across numerous independent “black-box” consumers. The benchmark emphasizes shuffling and routing, as opposed to domain-specific logic, and derives its design from the requirements of modern cloud observability platforms that must route massive volumes of high-velocity events to many concurrent stateful queries, each consuming a small relevant subset of the data. SH-Bench provides a configurable, reproducible method for assessing latency, throughput, and scalability of streaming systems, thus addressing a gap left by prior benchmarks that model more complex application logic or vertical-specific pipelines (Henning et al., 2024).

1. Motivation and Targeted Use Cases

SH-Bench was motivated by the operational needs seen in large cloud observability infrastructures, where millions of real-time queries, anomaly detectors, or alerts each maintain state and require timely delivery of matching events. In these settings, the dominant challenge is not the implementation of complex data analytics, but rather the efficient, reliable, and elastic redistribution of stream records to the appropriate state-local consumers. Existing benchmarks tend to focus on tightly-coupled analytics or domain-specific topologies, and do not adequately stress-test the essential shuffling abstraction in isolation.

By providing a minimalist, highly configurable workload centered on routing (flatMap/matcher), re-partitioning (groupBy/keyBy), and per-key aggregation with minimal “black-box” logic, SH-Bench isolates the central scaling and coordination bottlenecks of modern distributed stream processing, making it applicable to a wide range of industrial and academic settings beyond observability.

2. Benchmark Architecture and Workload Specification

SH-Bench models its workload as a continuous MapReduce-style dataflow, realized as follows:

Source ingestion: Records are ingested from a partitioned Kafka topic.
Matching: Each input record $r$ passes through a matcher (flatMap), which applies a user-configurable ruleset to possibly emit zero or more duplicates $r'$ , each tagged for a specific query key $k$ .
Shuffling: Records are re-partitioned by key $k$ using groupBy/keyBy, ensuring all events for a given query are routed to the same processing instance.
Stateful aggregation: A consumer function maintains configurable per-query state $S$ (e.g., counters, timestamps, checksums) and can emit alert events upon meeting application-defined conditions.
Output: Processed records are emitted to a Kafka output topic.

A separate distributed load generator produces records of configurable size and rate, writing to the Kafka input. The pipeline and all configuration parameters are designed for replication across multiple application instances (Kubernetes pods), providing systematic control over variables such as parallelism, number of query keys, record sizes, state cardinality, and selectivity distributions.

Textual schematic of a single pipeline instance:

Kafka(input) → flatMap(matcher) → groupBy(key) → aggregate(consumer, state) → Kafka(output)

3. Metrics and Evaluation Methodology

SH-Bench evaluates frameworks along three primary axes: throughput, latency, and scalability, using definitions informed by performance engineering best practices.

Throughput: Measured in records/sec. Ad-hoc throughput ( $T_{ah}$ ) is the observed rate at which records are read from Kafka under a fixed-generation rate. Sustainable throughput ( $T_s$ ) is the maximum input rate $\lambda$ for which the system maintains stable backlogs (i.e., input offset delta over time does not trend upward), determined by progressively increasing generator rate until Kafka consumption lags.
End-to-End Latency: Calculated as $L_i = t_{out} - t_{in}$ , where $t_{in}$ is the record's input timestamp and $t_{out}$ is its timestamp upon egress at the sink. Histograms and percentiles $L^{(p)}$ are reported.
Scalability: The benchmark employs the two-dimensional search from Theodolite to quantify both resource demand for a given load and maximum load under fixed resources, producing standard scaling curves.

Configurable parameters are grouped as follows:

Domain Parameters	System Parameters
Record rate, size	Application instances (pods)
Number of consumers	Cores/memory per instance
Selectivity	Kafka partitions
State size	Framework-specific tuning
Consumer output freq

4. Implementation and Framework Integration

ShuffleBench is implemented as open-source software and utilizes the following components:

Kubernetes for orchestration and resource management.
Theodolite for declarative benchmark specification, deployment automation, and result collection.
A shared Java library implements matcher and consumer logic, ensuring functional equivalence across framework adapters.
Framework adapters for:
- Apache Flink (DataStream API)
- Hazelcast Jet (pipeline API)
- Kafka Streams (Processor API)
- Spark Structured Streaming (Scala/Java API)
Distributed load generator (Java) and latency exporter aggregate metrics from Kafka sinks.

This architecture guarantees fairness in comparisons, as all frameworks execute the identical operator logic and state handling.

5. Experimental Results and Comparative Analysis

A reference evaluation was conducted on AWS EKS with a default configuration of 9 application instances (parallelism = 9), 1 million query keys, 20% total selectivity, 1KB records, and a 10-record emit frequency.

Key empirical findings include:

Throughput (sustainable, $T_s$ ): Flink (~950k rec/s) > Kafka Streams (~650k rec/s) > Hazelcast (~200k rec/s). Spark achieves ~270k rec/s only with large micro-batches (≥1M records), incurring significant latency.
Latency (95th percentile, 90k rec/s): Hazelcast (≈8 ms) < Flink (≈88 ms) < Kafka Streams (≈183 ms) << Spark (>10 s).
Impact of Configuration:
- Increasing per-pod cores marginally benefits Flink, Hazelcast, Kafka Streams; Spark performance degrades.
- Smaller record sizes increase throughput, particularly for Hazelcast.
- Fewer consumers yield higher throughput in Flink/Kafka Streams, not Hazelcast.
- Selectivity variation affects Flink/Kafka Streams throughput but not Hazelcast; latency non-monotonic due to batching artifacts.

Ad-hoc throughput was found to overestimate sustainable throughput (by up to 20% for Flink and Kafka Streams), highlighting the necessity of sustainable metrics for production relevance.

6. Limitations and Planned Extensions

Current limitations of SH-Bench include its focus on throughput and latency, absence of explicit reliability/fault-tolerance metrics, and evaluation within a single cloud provider environment. Planned extensions encompass:

Benchmarking of recovery time and state consistency under failure scenarios.
Exhaustive exploration of framework-specific tuning parameters (e.g., Flink buffer sizes, Kafka Streams commit intervals, Spark continuous processing modes).
Cross-cloud and bare-metal studies for robust generalization.
Introduction of skewed/variable selectivity and record sizes to simulate realistic hotspotting and consumer diversity.

7. Significance and Applications

SH-Bench enables rigorous, reproducible comparison of distributed stream processing frameworks—facilitating both industrial selection for high-scale routing requirements and academic research into novel shuffling and aggregation strategies. Empirical data demonstrates that while Flink offers superior raw throughput, Hazelcast provides the lowest end-to-end latency, and Kafka Streams occupies a middle ground; Spark, operating on large micro-batches, incurs much higher latency for comparable throughput (Henning et al., 2024). This evidence directly supports both engineering decisions in production deployments and future research into architectural and algorithmic optimizations for scalable, low-latency stream processing.

Markdown Upgrade to Chat

References (1)

ShuffleBench: A Benchmark for Large-Scale Data Shuffling Operations with Distributed Stream Processing Frameworks (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SH-Bench.