Papers
Topics
Authors
Recent
Search
2000 character limit reached

Coded Distributed Computing Optimization

Updated 3 February 2026
  • Coded Distributed Computing (CDC) is a framework that reduces shuffle load by introducing structured computation redundancy.
  • It optimizes heterogeneous systems via joint file placement and nested coded shuffling schemes under nonuniform file popularity.
  • Low-complexity methods like the two-file-group heuristic and C-CDC substantially reduce communications, achieving near-optimal load reductions.

Coded Distributed Computing (CDC) is a framework for minimizing the communication bottleneck in distributed MapReduce systems by intentionally introducing structured computation redundancy, thereby maximizing multicasting opportunities and reducing the amount of data exchanged in the shuffle phase. Recent developments have extended CDC to heterogeneous environments with nonuniform file popularity, flexible worker storage and computation capacities, and application-driven optimizations. This article focuses on the design, analysis, and optimization of heterogeneous CDC systems under nonuniform file popularity, with emphasis on the nested coded shuffling paradigm, optimization formulations, low-complexity approximation methods, and shuffling load reduction through data aggregation (Deng et al., 2023).

1. System Model and Heterogeneity

CDC is considered in a setting where KK workers each store up to MkM_k files and are responsible for a fraction WkW_k of the QQ reduce functions, satisfying kWk=1\sum_k W_k=1. The input file library contains NN files with popularity profile p=(p1,,pN)p=(p_1,\dots,p_N), typically Zipf-like, imposing nonuniform file request probabilities pnp_n. Job requests draw subsets W{1,,N}\mathcal W\subseteq\{1,\dots,N\} of fixed size D=WD=|\mathcal W| with induced product probability

pWnWpnnW(1pn)p_{\mathcal W} \propto \prod_{n\in\mathcal W} p_n \prod_{n\notin\mathcal W} (1-p_n)

to model skewed access patterns and probabilistic locality.

Each worker kk stores up to MkM_k files and is assigned a subset Qk\mathcal Q_k of the QQ reduce functions. File placement is encoded using binary variables tn,S{0,1}t_{n,S} \in\{0,1\}, indicating storage of file nn on exactly the subset SS of workers. The assignment must satisfy that every file is placed (Stn,S=1\sum_S t_{n,S}=1 for all nn) and each worker's storage is not exceeded (n=1NSktn,SMk\sum_{n=1}^N \sum_{S\ni k} t_{n,S} \le M_k for each kk).

2. File Placement under Nonuniform Popularity

Optimal file placement jointly considers the workers' heterogeneity and the nonuniform popularity distribution. Each file is assigned to a specific nonempty worker subset SS with the variable tn,St_{n,S}. The resulting aS=nWtn,Sa_S=\sum_{n \in \mathcal W} t_{n,S} counts the requested files from W\mathcal W located on SS for the current job.

Due to the combinatorial explosion in the number of subsets SS, directly solving for the best tn,St_{n,S} is intractable for moderate KK. To make this practical, a two-file-group heuristic groups the files into "popular" and "unpopular" classes by selecting a threshold N1N_1. The N1N_1 most popular files are greedily stored to maximize redundancy (filling remaining storage), while the less popular files are placed minimally (each mapped to only one worker), reducing mapping redundancy and storage cost.

3. Nested Coded Shuffling Scheme

The coded shuffling stage delivers, for each worker kk, all intermediate values (IVs) Vq,nV_{q,n} required for the assigned reduce tasks qQkq \in \mathcal Q_k, covering files nWn \in \mathcal W not locally stored. The nested shuffling is a recursive procedure defined for each worker subset S{1,,K}S \subseteq \{1, \dots, K\}, S2|S|\ge 2:

  • Step 1: For each kSk \in S, identify the residual IVs, Rk,S(0)\mathcal R_{k,S}^{(0)}, needed from S{k}S \setminus \{k\} not previously coded in larger supersets.
  • Step 2: Within SS, for each pair (j,k)(j,k), jkj\ne k, select a segment Mkj(S)Rk,S(0)\mathcal M_{k \leftarrow j}(S) \subseteq \mathcal R_{k,S}^{(0)} of equal normalized size Lj,SL_{j,S} for all kS{j}k\in S \setminus\{j\}. The remainder, Rk,S(1)\mathcal R_{k,S}^{(1)}, is carried down to be resolved at lower cardinality.
  • Step 3: Each worker jj forms the coded packet

XSj=kS{j}Mkj(S)X_{S \leftarrow j} = \bigoplus_{k \in S \setminus \{j\}} \mathcal M_{k \leftarrow j}(S)

and multicasts it to SS.

This process recursively continues until S=2|S|=2, where remaining residuals are unicasted. This scheme generalizes the coded multicasting gain to settings with nonuniform file popularity, variable IV sizes, and arbitrary file placements.

The total (expected) shuffle load is expressed as

Rˉ=WpWS[K],S2jSLj,S\bar R = \sum_{\mathcal W \ne \emptyset} p_{\mathcal W} \sum_{S \subseteq [K], |S| \ge 2} \sum_{j \in S} L_{j,S}

subject to the placement and nesting constraints.

4. Joint Placement and Shuffle Optimization

The joint design seeks tn,St_{n,S}, Lj,SL_{j,S}, and nested residual variables rk,S(i)r_{k,S}^{(i)} that minimize the expected shuffle load Rˉ\bar R, subject to capacity and recursion constraints. The resulting problem is a large-scale mixed-integer linear program (MILP): mint,L,rRˉsubject toSktn,SMk,Stn,S=1,tn,S{0,1}\min_{t,L,r} \bar R \qquad \text{subject to} \qquad \sum_{S\ni k} t_{n,S} \le M_k, \quad \sum_S t_{n,S}=1, \quad t_{n,S}\in\{0,1\} plus linear constraints for the nested flow-conservation at every (W,S,k)(\mathcal W, S, k). The problem is NP-hard and its size grows exponentially with KK.

5. Two-File-Group Low-Complexity Method

For scalability, a two-file-group approach reduces search complexity by fixing the placement as follows: unpopular files are round-robin assigned to minimize redundancy, and the N1N_1 most popular files fill remaining storage greedily. For each threshold N1N_1, the remaining subproblem with fixed placement becomes an LP in the shuffling design variables, solved efficiently for all N1N_1.

The best split N1N_1^* is selected to minimize Rˉ(N1)\bar R(N_1). In practice, K8K \le 8 suffices to ensure tractability, and numerical results indicate this heuristic achieves within a few percent of the full MILP solution and within 10–15% of a fundamental lower bound. For highly skewed popularity (θ1\theta \gg 1), the two-group solution is nearly optimal—within 3% of the bound.

Approach Complexity Performance gap
MILP (branch-and-cut) Exponential in KK Baseline (optimal)
Two-file-group + LP NN LPs, O(e3K)O(e^{3K}) Within $3-15$\%

6. Aggregate Jobs and Heterogeneous Compressed CDC (C-CDC)

For applications where the reduce functions are aggregate (e.g., mini-batch gradients, sums), the framework introduces a compressed CDC (C-CDC) variant that leverages local aggregation to reduce shuffle load. For aggregates of the form

ϕq(W)=nWVq,n\phi_q(\mathcal W) = \sum_{n\in \mathcal W} V_{q,n}

each worker pre-aggregates IVs for files it holds, so only a single T-bit aggregate needs to be delivered per group, replacing S{k}|S\setminus\{k\}| individual IVs with one sum.

The compressed shuffle follows the same nested approach as in standard CDC, but variable Lj,SL_{j,S} are now associated with aggregate-packets and receive adjusted flow constraints. The result is a further 20–30% reduction in shuffling load over the standard (uncompressed) CDC scheme.

7. Empirical Results and Insights

Empirical studies using the above framework yield several key results:

  • For moderate system sizes (K,N)(4,8)(K,N)\approx (4,8), full MILP is infeasible, while the two-file-group method solves in seconds.
  • Across realistic ranges of Zipf popularity (θ=0.3\theta=0.3–$1.2$), the two-file-group solution is within a few percent of the MILP and within 10–15% of an information-relaxed lower bound.
  • Compared to uncoded round-robin baseline, the proposed heterogeneous CDC achieves up to 4×4\times reduction in shuffle load, with aggregation-based compression (C-CDC) delivering an additional 20–30% saving.
  • As popularity skew increases, redundancy is concentrated on a few highly requested files, and the two-group heuristic approaches the theoretical optimum.

Summary Table: CDC Scheme Capabilities

CDC Scheme Heterogeneous MkM_k, WkW_k Nonuniform Popularity (pnp_n) Compressed/Aggregate-IV Low-Complexity Approximation Multicasting Method
(Deng et al., 2023) Yes Yes Yes (C-CDC) Two-file-group Nested coded shuffling

The flexible heterogeneous CDC paradigm (Deng et al., 2023)—which jointly optimizes worker-aware file placement and generalized coded shuffling, incorporates file popularity, and exploits data aggregation—substantially improves shuffle communication efficiency in practical distributed systems. This framework extends classical CDC to heterogeneous, data-skewed, and ML-driven settings with scalable solution methods and analytically quantifiable performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Coded Distributed Computing (CDC).