Papers
Topics
Authors
Recent
Search
2000 character limit reached

CodedTeraSort: Optimizing Distributed Sorting

Updated 3 February 2026
  • CodedTeraSort is a distributed sorting algorithm that utilizes structured coding and redundancy to trade increased computation for reduced communication during the shuffle phase.
  • It employs CDC frameworks, structured file placement, and coded multicasting to achieve up to 4.7× speedup over vanilla TeraSort in large-scale clusters.
  • The design integrates combinatorial techniques that optimize subpacketization and multicast group management, making it scalable for real-world MapReduce implementations.

CodedTeraSort is a distributed sorting algorithm that augments the canonical TeraSort benchmark by incorporating structured coding and redundancy techniques to optimally trade increased computation for dramatically reduced communication during the shuffle phase. Built on the Coded Distributed Computing (CDC) framework, CodedTeraSort leverages controlled replication of input data, coded multicasting, and in-network coding to achieve significant speedups—empirically up to 4.7×4.7\times over vanilla TeraSort—on practical clusters. The scheme is theoretically grounded in an explicit computation-communication tradeoff, achieving near-optimal reduction in shuffle traffic while controlling subpacketization and system complexity through recent advances involving combinatorial designs.

1. System Model and Core Principles

CodedTeraSort operates within a distributed MapReduce architecture targeting the global sorting of %%%%1%%%% key–value (KV) pairs using KK worker nodes. The sort is realized in three pipeline stages: Map, Shuffle, and Reduce. In TeraSort, the Shuffle phase is frequently the dominant bottleneck, consuming 70%−98%70\%-98\% of total job time due to heavy inter-node data exchanges.

To alleviate this, CodedTeraSort introduces a novel coding-based mechanism grounded in the CDC framework (Li et al., 2016). The fundamental insight is that by increasing the computation load (rr), defined as the average number of times each input record is mapped across the system, one can construct multicast opportunities and reduce the normalized communication load L(r)L(r). The optimal theoretical tradeoff is established as

L∗(r)=1r(1−rK),L^*(r) = \frac{1}{r}\left(1 - \frac{r}{K}\right),

with local computation and coded multicasts orchestrated to achieve this bound.

2. Algorithmic Workflow and Coding Mechanisms

The CodedTeraSort algorithm consists of six primary stages:

  1. Structured File Placement: The input is divided into (Kr)\binom{K}{r} (or, when using designs, qk−1q^{k-1}) batches, each assigned to exactly rr workers. In the classic CDC approach, this assignment is indexed by rr-subsets of [K][K]; in resolvable design-based variants, the batches correspond to blocks of the design (Li et al., 2017, Konstantinidis et al., 2018).
  2. Map Phase: Each worker maps all locally held records by partitioning KV pairs into KK buckets keyed by range partitioning. For each batch, only the intermediate buckets needed by remote workers or locally are retained.
  3. Encoding of Coded Multicast Packets: For every subset of r+1r+1 workers, each member constructs an XOR of rr intermediate packets, designed so that every recipient, using local side information, can decode its required packet.
  4. Multicast Shuffle: Each worker group performs a single group-broadcast, where each node sends its coded packet to the rr peers in its group (typically via MPI_Bcast or custom multicast primitives).
  5. Decoding: Upon receipt of coded packets, each node XORs out the locally known components to isolate and recover the one necessary intermediate.
  6. Reduce Phase: With all intermediates for their partition now available, workers locally sort and output the final segment of the global sorted list.

This process requires careful group formation and indexing of batches, either via enumeration of rr-subsets or using the structure imposed by resolvable designs to ensure balanced subpacketization (Konstantinidis et al., 2018).

3. Communication–Computation Tradeoff and Theoretical Analysis

Let Luncoded(r)=1−r/KL_{\mathrm{uncoded}}(r) = 1 - r/K denote the normalized communication load with rr-fold uncoded redundancy. Implementing CDC-based coded multicasting achieves

Lcoded(r)=1r(1−rK).L_{\mathrm{coded}}(r) = \frac{1}{r}\left(1-\frac{r}{K}\right).

This is information-theoretically optimal for the class of schemes considered [(Li et al., 2016), Theorem 1].

The Map phase incurs an rr-fold computation cost, but yields an rr-fold reduction in Shuffle traffic, which, when the network is the bottleneck, results in substantial total wall-time speedup.

In design-based variants, the combinatorial structure is exploited to dramatically reduce the required number of subfiles to N=qk−1N = q^{k-1} rather than (Kr)\binom{K}{r} (which becomes prohibitive as KK increases). This construction also keeps the number of multicast groups (and associated communicator splits) polynomial in KK (Konstantinidis et al., 2018).

4. Empirical Performance and Practical Guidelines

Experiments on Amazon EC2 with K=16,20K=16, 20 workers and N≈12N\approx 12 GB of input yielded the following (all times in seconds):

Configuration Total Time Shuffle Time Speedup (vs TeraSort)
TeraSort (r=1r=1) $961$ $946$ 1×1\times
CodedTeraSort (r=3r=3) $446$ $412$ 2.16×2.16\times
CodedTeraSort (r=5r=5) $283$ $223$ 3.39×3.39\times

Using design-based subpacketization, speedup up to 4.7×4.7\times has been demonstrated (Konstantinidis et al., 2018).

Practical configuration guidelines are:

  • Choose r≈⌈Tshuffle/Tmap⌉r \approx \lceil \sqrt{T_{\text{shuffle}}/T_{\text{map}}} \rceil to balance compute and network costs.
  • Avoid large rr where (Kr+1)\binom{K}{r+1} or memory overhead becomes prohibitive.
  • For clusters with K≤20K\leq20, optimal rr is typically in [3,5][3,5].

Empirical results confirm that CodedTeraSort's performance closely tracks the predicted (1/r)(1/r) communication law, with optimal performance realized where shuffle time dominates total runtime.

5. Combinatorial Designs and Advances in Subpacketization

A key advancement in recent CodedTeraSort variants is the use of resolvable designs derived from single-parity-check (SPC) codes (Konstantinidis et al., 2018). In these schemes:

  • Data is partitioned via blocks indexed by codewords in a (k,k−1)(k, k-1) SPC code over Zq\mathbb{Z}_q.
  • The design's parallel classes allow for the systematic allocation of Map tasks and the scheduling of multicast groups with efficient incidence properties.
  • Subpacketization N=qk−1N=q^{k-1} is polynomial in KK, permitting practical implementation on large clusters.
  • Theoretical analysis shows that the design-based approach achieves (near-)optimal reduction in communication, with the shuffle load per worker

Ldesign(r)=1r−1(1−rK).L_{\mathrm{design}}(r) = \frac{1}{r-1} \left(1 - \frac{r}{K}\right).

For large KK and moderate rr, this matches or improves upon the performance of prior approaches while dramatically lowering subfile and multicast group counts.

6. Implementation Considerations and Complexity

At practical scales:

  • The cost of group/table generation (CodeGen) grows as O((Kr+1))O(\binom{K}{r+1}) for classical CDC, but only polynomially for design-based variants.
  • Map phase computation scales linearly with rr.
  • Encoding and decoding involve cheap XOR operations with O(r)O(r) complexity per multicast group.
  • Memory requirements reflect rr-fold data replication and retain the coded/intermediate buffers per worker.
  • In Hadoop/Spark ecosystems, key integration steps include redefining input formats for structured redundancy, modifying partitioners for group-aware multicasting, and adapting shuffle implementations to support coded communication (Li et al., 2016, Li et al., 2017).

CodedTeraSort's approach is most beneficial when network bandwidth is much more limiting than local compute or I/O, and subpacketization is managed to avoid resource exhaustion.

7. Limitations, Extensions, and Further Directions

The primary constraint in early CodedTeraSort schemes is exponential growth of required subfiles and multicast groups as KK and rr increase, which imposes practical limits on system scale and parameter selection. Resolvable design-based schemes address this by controlling subpacketization and making deployments on real-world clusters with larger KK and moderate rr feasible (Konstantinidis et al., 2018).

A plausible implication is that further improvements may leverage alternative combinatorial or algebraic constructions offering similar coding and multicast incidence properties but with even better scalability or adaptability to heterogeneous cluster environments.

CodedTeraSort formalizes and empirically validates the optimal tradeoff between computation and communication in distributed sorting, establishing a benchmark for design and analysis of coded distributed algorithms across broader problem classes.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CodedTeraSort Algorithm.