Coded Distributed Computing Optimization
- Coded Distributed Computing (CDC) is a framework that reduces shuffle load by introducing structured computation redundancy.
- It optimizes heterogeneous systems via joint file placement and nested coded shuffling schemes under nonuniform file popularity.
- Low-complexity methods like the two-file-group heuristic and C-CDC substantially reduce communications, achieving near-optimal load reductions.
Coded Distributed Computing (CDC) is a framework for minimizing the communication bottleneck in distributed MapReduce systems by intentionally introducing structured computation redundancy, thereby maximizing multicasting opportunities and reducing the amount of data exchanged in the shuffle phase. Recent developments have extended CDC to heterogeneous environments with nonuniform file popularity, flexible worker storage and computation capacities, and application-driven optimizations. This article focuses on the design, analysis, and optimization of heterogeneous CDC systems under nonuniform file popularity, with emphasis on the nested coded shuffling paradigm, optimization formulations, low-complexity approximation methods, and shuffling load reduction through data aggregation (Deng et al., 2023).
1. System Model and Heterogeneity
CDC is considered in a setting where workers each store up to files and are responsible for a fraction of the reduce functions, satisfying . The input file library contains files with popularity profile , typically Zipf-like, imposing nonuniform file request probabilities . Job requests draw subsets of fixed size with induced product probability
to model skewed access patterns and probabilistic locality.
Each worker stores up to files and is assigned a subset of the reduce functions. File placement is encoded using binary variables , indicating storage of file on exactly the subset of workers. The assignment must satisfy that every file is placed ( for all ) and each worker's storage is not exceeded ( for each ).
2. File Placement under Nonuniform Popularity
Optimal file placement jointly considers the workers' heterogeneity and the nonuniform popularity distribution. Each file is assigned to a specific nonempty worker subset with the variable . The resulting counts the requested files from located on for the current job.
Due to the combinatorial explosion in the number of subsets , directly solving for the best is intractable for moderate . To make this practical, a two-file-group heuristic groups the files into "popular" and "unpopular" classes by selecting a threshold . The most popular files are greedily stored to maximize redundancy (filling remaining storage), while the less popular files are placed minimally (each mapped to only one worker), reducing mapping redundancy and storage cost.
3. Nested Coded Shuffling Scheme
The coded shuffling stage delivers, for each worker , all intermediate values (IVs) required for the assigned reduce tasks , covering files not locally stored. The nested shuffling is a recursive procedure defined for each worker subset , :
- Step 1: For each , identify the residual IVs, , needed from not previously coded in larger supersets.
- Step 2: Within , for each pair , , select a segment of equal normalized size for all . The remainder, , is carried down to be resolved at lower cardinality.
- Step 3: Each worker forms the coded packet
and multicasts it to .
This process recursively continues until , where remaining residuals are unicasted. This scheme generalizes the coded multicasting gain to settings with nonuniform file popularity, variable IV sizes, and arbitrary file placements.
The total (expected) shuffle load is expressed as
subject to the placement and nesting constraints.
4. Joint Placement and Shuffle Optimization
The joint design seeks , , and nested residual variables that minimize the expected shuffle load , subject to capacity and recursion constraints. The resulting problem is a large-scale mixed-integer linear program (MILP): plus linear constraints for the nested flow-conservation at every . The problem is NP-hard and its size grows exponentially with .
5. Two-File-Group Low-Complexity Method
For scalability, a two-file-group approach reduces search complexity by fixing the placement as follows: unpopular files are round-robin assigned to minimize redundancy, and the most popular files fill remaining storage greedily. For each threshold , the remaining subproblem with fixed placement becomes an LP in the shuffling design variables, solved efficiently for all .
The best split is selected to minimize . In practice, suffices to ensure tractability, and numerical results indicate this heuristic achieves within a few percent of the full MILP solution and within 10–15% of a fundamental lower bound. For highly skewed popularity (), the two-group solution is nearly optimal—within 3% of the bound.
| Approach | Complexity | Performance gap |
|---|---|---|
| MILP (branch-and-cut) | Exponential in | Baseline (optimal) |
| Two-file-group + LP | LPs, | Within $3-15$\% |
6. Aggregate Jobs and Heterogeneous Compressed CDC (C-CDC)
For applications where the reduce functions are aggregate (e.g., mini-batch gradients, sums), the framework introduces a compressed CDC (C-CDC) variant that leverages local aggregation to reduce shuffle load. For aggregates of the form
each worker pre-aggregates IVs for files it holds, so only a single T-bit aggregate needs to be delivered per group, replacing individual IVs with one sum.
The compressed shuffle follows the same nested approach as in standard CDC, but variable are now associated with aggregate-packets and receive adjusted flow constraints. The result is a further 20–30% reduction in shuffling load over the standard (uncompressed) CDC scheme.
7. Empirical Results and Insights
Empirical studies using the above framework yield several key results:
- For moderate system sizes , full MILP is infeasible, while the two-file-group method solves in seconds.
- Across realistic ranges of Zipf popularity (–$1.2$), the two-file-group solution is within a few percent of the MILP and within 10–15% of an information-relaxed lower bound.
- Compared to uncoded round-robin baseline, the proposed heterogeneous CDC achieves up to reduction in shuffle load, with aggregation-based compression (C-CDC) delivering an additional 20–30% saving.
- As popularity skew increases, redundancy is concentrated on a few highly requested files, and the two-group heuristic approaches the theoretical optimum.
Summary Table: CDC Scheme Capabilities
| CDC Scheme | Heterogeneous , | Nonuniform Popularity () | Compressed/Aggregate-IV | Low-Complexity Approximation | Multicasting Method |
|---|---|---|---|---|---|
| (Deng et al., 2023) | Yes | Yes | Yes (C-CDC) | Two-file-group | Nested coded shuffling |
The flexible heterogeneous CDC paradigm (Deng et al., 2023)—which jointly optimizes worker-aware file placement and generalized coded shuffling, incorporates file popularity, and exploits data aggregation—substantially improves shuffle communication efficiency in practical distributed systems. This framework extends classical CDC to heterogeneous, data-skewed, and ML-driven settings with scalable solution methods and analytically quantifiable performance.