CodedTeraSort: Optimizing Distributed Sorting
- CodedTeraSort is a distributed sorting algorithm that utilizes structured coding and redundancy to trade increased computation for reduced communication during the shuffle phase.
- It employs CDC frameworks, structured file placement, and coded multicasting to achieve up to 4.7× speedup over vanilla TeraSort in large-scale clusters.
- The design integrates combinatorial techniques that optimize subpacketization and multicast group management, making it scalable for real-world MapReduce implementations.
CodedTeraSort is a distributed sorting algorithm that augments the canonical TeraSort benchmark by incorporating structured coding and redundancy techniques to optimally trade increased computation for dramatically reduced communication during the shuffle phase. Built on the Coded Distributed Computing (CDC) framework, CodedTeraSort leverages controlled replication of input data, coded multicasting, and in-network coding to achieve significant speedups—empirically up to over vanilla TeraSort—on practical clusters. The scheme is theoretically grounded in an explicit computation-communication tradeoff, achieving near-optimal reduction in shuffle traffic while controlling subpacketization and system complexity through recent advances involving combinatorial designs.
1. System Model and Core Principles
CodedTeraSort operates within a distributed MapReduce architecture targeting the global sorting of %%%%1%%%% key–value (KV) pairs using worker nodes. The sort is realized in three pipeline stages: Map, Shuffle, and Reduce. In TeraSort, the Shuffle phase is frequently the dominant bottleneck, consuming of total job time due to heavy inter-node data exchanges.
To alleviate this, CodedTeraSort introduces a novel coding-based mechanism grounded in the CDC framework (Li et al., 2016). The fundamental insight is that by increasing the computation load (), defined as the average number of times each input record is mapped across the system, one can construct multicast opportunities and reduce the normalized communication load . The optimal theoretical tradeoff is established as
with local computation and coded multicasts orchestrated to achieve this bound.
2. Algorithmic Workflow and Coding Mechanisms
The CodedTeraSort algorithm consists of six primary stages:
- Structured File Placement: The input is divided into (or, when using designs, ) batches, each assigned to exactly workers. In the classic CDC approach, this assignment is indexed by -subsets of ; in resolvable design-based variants, the batches correspond to blocks of the design (Li et al., 2017, Konstantinidis et al., 2018).
- Map Phase: Each worker maps all locally held records by partitioning KV pairs into buckets keyed by range partitioning. For each batch, only the intermediate buckets needed by remote workers or locally are retained.
- Encoding of Coded Multicast Packets: For every subset of workers, each member constructs an XOR of intermediate packets, designed so that every recipient, using local side information, can decode its required packet.
- Multicast Shuffle: Each worker group performs a single group-broadcast, where each node sends its coded packet to the peers in its group (typically via MPI_Bcast or custom multicast primitives).
- Decoding: Upon receipt of coded packets, each node XORs out the locally known components to isolate and recover the one necessary intermediate.
- Reduce Phase: With all intermediates for their partition now available, workers locally sort and output the final segment of the global sorted list.
This process requires careful group formation and indexing of batches, either via enumeration of -subsets or using the structure imposed by resolvable designs to ensure balanced subpacketization (Konstantinidis et al., 2018).
3. Communication–Computation Tradeoff and Theoretical Analysis
Let denote the normalized communication load with -fold uncoded redundancy. Implementing CDC-based coded multicasting achieves
This is information-theoretically optimal for the class of schemes considered [(Li et al., 2016), Theorem 1].
The Map phase incurs an -fold computation cost, but yields an -fold reduction in Shuffle traffic, which, when the network is the bottleneck, results in substantial total wall-time speedup.
In design-based variants, the combinatorial structure is exploited to dramatically reduce the required number of subfiles to rather than (which becomes prohibitive as increases). This construction also keeps the number of multicast groups (and associated communicator splits) polynomial in (Konstantinidis et al., 2018).
4. Empirical Performance and Practical Guidelines
Experiments on Amazon EC2 with workers and GB of input yielded the following (all times in seconds):
| Configuration | Total Time | Shuffle Time | Speedup (vs TeraSort) |
|---|---|---|---|
| TeraSort () | $961$ | $946$ | |
| CodedTeraSort () | $446$ | $412$ | |
| CodedTeraSort () | $283$ | $223$ |
Using design-based subpacketization, speedup up to has been demonstrated (Konstantinidis et al., 2018).
Practical configuration guidelines are:
- Choose to balance compute and network costs.
- Avoid large where or memory overhead becomes prohibitive.
- For clusters with , optimal is typically in .
Empirical results confirm that CodedTeraSort's performance closely tracks the predicted communication law, with optimal performance realized where shuffle time dominates total runtime.
5. Combinatorial Designs and Advances in Subpacketization
A key advancement in recent CodedTeraSort variants is the use of resolvable designs derived from single-parity-check (SPC) codes (Konstantinidis et al., 2018). In these schemes:
- Data is partitioned via blocks indexed by codewords in a SPC code over .
- The design's parallel classes allow for the systematic allocation of Map tasks and the scheduling of multicast groups with efficient incidence properties.
- Subpacketization is polynomial in , permitting practical implementation on large clusters.
- Theoretical analysis shows that the design-based approach achieves (near-)optimal reduction in communication, with the shuffle load per worker
For large and moderate , this matches or improves upon the performance of prior approaches while dramatically lowering subfile and multicast group counts.
6. Implementation Considerations and Complexity
At practical scales:
- The cost of group/table generation (CodeGen) grows as for classical CDC, but only polynomially for design-based variants.
- Map phase computation scales linearly with .
- Encoding and decoding involve cheap XOR operations with complexity per multicast group.
- Memory requirements reflect -fold data replication and retain the coded/intermediate buffers per worker.
- In Hadoop/Spark ecosystems, key integration steps include redefining input formats for structured redundancy, modifying partitioners for group-aware multicasting, and adapting shuffle implementations to support coded communication (Li et al., 2016, Li et al., 2017).
CodedTeraSort's approach is most beneficial when network bandwidth is much more limiting than local compute or I/O, and subpacketization is managed to avoid resource exhaustion.
7. Limitations, Extensions, and Further Directions
The primary constraint in early CodedTeraSort schemes is exponential growth of required subfiles and multicast groups as and increase, which imposes practical limits on system scale and parameter selection. Resolvable design-based schemes address this by controlling subpacketization and making deployments on real-world clusters with larger and moderate feasible (Konstantinidis et al., 2018).
A plausible implication is that further improvements may leverage alternative combinatorial or algebraic constructions offering similar coding and multicast incidence properties but with even better scalability or adaptability to heterogeneous cluster environments.
CodedTeraSort formalizes and empirically validates the optimal tradeoff between computation and communication in distributed sorting, establishing a benchmark for design and analysis of coded distributed algorithms across broader problem classes.
References:
- "A Fundamental Tradeoff between Computation and Communication in Distributed Computing" (Li et al., 2016)
- "Coded TeraSort" (Li et al., 2017)
- "Leveraging Coding Techniques for Speeding up Distributed Computing" (Konstantinidis et al., 2018)