Coded Distributed Computing Optimization

Updated 3 February 2026

Coded Distributed Computing (CDC) is a framework that reduces shuffle load by introducing structured computation redundancy.
It optimizes heterogeneous systems via joint file placement and nested coded shuffling schemes under nonuniform file popularity.
Low-complexity methods like the two-file-group heuristic and C-CDC substantially reduce communications, achieving near-optimal load reductions.

Coded Distributed Computing (CDC) is a framework for minimizing the communication bottleneck in distributed MapReduce systems by intentionally introducing structured computation redundancy, thereby maximizing multicasting opportunities and reducing the amount of data exchanged in the shuffle phase. Recent developments have extended CDC to heterogeneous environments with nonuniform file popularity, flexible worker storage and computation capacities, and application-driven optimizations. This article focuses on the design, analysis, and optimization of heterogeneous CDC systems under nonuniform file popularity, with emphasis on the nested coded shuffling paradigm, optimization formulations, low-complexity approximation methods, and shuffling load reduction through data aggregation (Deng et al., 2023).

1. System Model and Heterogeneity

CDC is considered in a setting where $K$ workers each store up to $M_k$ files and are responsible for a fraction $W_k$ of the $Q$ reduce functions, satisfying $\sum_k W_k=1$ . The input file library contains $N$ files with popularity profile $p=(p_1,\dots,p_N)$ , typically Zipf-like, imposing nonuniform file request probabilities $p_n$ . Job requests draw subsets $\mathcal W\subseteq\{1,\dots,N\}$ of fixed size $D=|\mathcal W|$ with induced product probability

$p_{\mathcal W} \propto \prod_{n\in\mathcal W} p_n \prod_{n\notin\mathcal W} (1-p_n)$

to model skewed access patterns and probabilistic locality.

Each worker $k$ stores up to $M_k$ files and is assigned a subset $\mathcal Q_k$ of the $Q$ reduce functions. File placement is encoded using binary variables $t_{n,S} \in\{0,1\}$ , indicating storage of file $n$ on exactly the subset $S$ of workers. The assignment must satisfy that every file is placed ( $\sum_S t_{n,S}=1$ for all $n$ ) and each worker's storage is not exceeded ( $\sum_{n=1}^N \sum_{S\ni k} t_{n,S} \le M_k$ for each $k$ ).

2. File Placement under Nonuniform Popularity

Optimal file placement jointly considers the workers' heterogeneity and the nonuniform popularity distribution. Each file is assigned to a specific nonempty worker subset $S$ with the variable $t_{n,S}$ . The resulting $a_S=\sum_{n \in \mathcal W} t_{n,S}$ counts the requested files from $\mathcal W$ located on $S$ for the current job.

Due to the combinatorial explosion in the number of subsets $S$ , directly solving for the best $t_{n,S}$ is intractable for moderate $K$ . To make this practical, a two-file-group heuristic groups the files into "popular" and "unpopular" classes by selecting a threshold $N_1$ . The $N_1$ most popular files are greedily stored to maximize redundancy (filling remaining storage), while the less popular files are placed minimally (each mapped to only one worker), reducing mapping redundancy and storage cost.

3. Nested Coded Shuffling Scheme

The coded shuffling stage delivers, for each worker $k$ , all intermediate values (IVs) $V_{q,n}$ required for the assigned reduce tasks $q \in \mathcal Q_k$ , covering files $n \in \mathcal W$ not locally stored. The nested shuffling is a recursive procedure defined for each worker subset $S \subseteq \{1, \dots, K\}$ , $|S|\ge 2$ :

Step 1: For each $k \in S$ , identify the residual IVs, $\mathcal R_{k,S}^{(0)}$ , needed from $S \setminus \{k\}$ not previously coded in larger supersets.
Step 2: Within $S$ , for each pair $(j,k)$ , $j\ne k$ , select a segment $\mathcal M_{k \leftarrow j}(S) \subseteq \mathcal R_{k,S}^{(0)}$ of equal normalized size $L_{j,S}$ for all $k\in S \setminus\{j\}$ . The remainder, $\mathcal R_{k,S}^{(1)}$ , is carried down to be resolved at lower cardinality.
Step 3: Each worker $j$ forms the coded packet

$X_{S \leftarrow j} = \bigoplus_{k \in S \setminus \{j\}} \mathcal M_{k \leftarrow j}(S)$

and multicasts it to $S$ .

This process recursively continues until $|S|=2$ , where remaining residuals are unicasted. This scheme generalizes the coded multicasting gain to settings with nonuniform file popularity, variable IV sizes, and arbitrary file placements.

The total (expected) shuffle load is expressed as

$\bar R = \sum_{\mathcal W \ne \emptyset} p_{\mathcal W} \sum_{S \subseteq [K], |S| \ge 2} \sum_{j \in S} L_{j,S}$

subject to the placement and nesting constraints.

4. Joint Placement and Shuffle Optimization

The joint design seeks $t_{n,S}$ , $L_{j,S}$ , and nested residual variables $r_{k,S}^{(i)}$ that minimize the expected shuffle load $\bar R$ , subject to capacity and recursion constraints. The resulting problem is a large-scale mixed-integer linear program (MILP): $\min_{t,L,r} \bar R \qquad \text{subject to} \qquad \sum_{S\ni k} t_{n,S} \le M_k, \quad \sum_S t_{n,S}=1, \quad t_{n,S}\in\{0,1\}$ plus linear constraints for the nested flow-conservation at every $(\mathcal W, S, k)$ . The problem is NP-hard and its size grows exponentially with $K$ .

5. Two-File-Group Low-Complexity Method

For scalability, a two-file-group approach reduces search complexity by fixing the placement as follows: unpopular files are round-robin assigned to minimize redundancy, and the $N_1$ most popular files fill remaining storage greedily. For each threshold $N_1$ , the remaining subproblem with fixed placement becomes an LP in the shuffling design variables, solved efficiently for all $N_1$ .

The best split $N_1^*$ is selected to minimize $\bar R(N_1)$ . In practice, $K \le 8$ suffices to ensure tractability, and numerical results indicate this heuristic achieves within a few percent of the full MILP solution and within 10–15% of a fundamental lower bound. For highly skewed popularity ( $\theta \gg 1$ ), the two-group solution is nearly optimal—within 3% of the bound.

Approach	Complexity	Performance gap
MILP (branch-and-cut)	Exponential in $K$	Baseline (optimal)
Two-file-group + LP	$N$ LPs, $O(e^{3K})$	Within $3-15$\%

6. Aggregate Jobs and Heterogeneous Compressed CDC (C-CDC)

For applications where the reduce functions are aggregate (e.g., mini-batch gradients, sums), the framework introduces a compressed CDC (C-CDC) variant that leverages local aggregation to reduce shuffle load. For aggregates of the form

$\phi_q(\mathcal W) = \sum_{n\in \mathcal W} V_{q,n}$

each worker pre-aggregates IVs for files it holds, so only a single T-bit aggregate needs to be delivered per group, replacing $|S\setminus\{k\}|$ individual IVs with one sum.

The compressed shuffle follows the same nested approach as in standard CDC, but variable $L_{j,S}$ are now associated with aggregate-packets and receive adjusted flow constraints. The result is a further 20–30% reduction in shuffling load over the standard (uncompressed) CDC scheme.

7. Empirical Results and Insights

Empirical studies using the above framework yield several key results:

For moderate system sizes $(K,N)\approx (4,8)$ , full MILP is infeasible, while the two-file-group method solves in seconds.
Across realistic ranges of Zipf popularity ( $\theta=0.3$ –$1.2$), the two-file-group solution is within a few percent of the MILP and within 10–15% of an information-relaxed lower bound.
Compared to uncoded round-robin baseline, the proposed heterogeneous CDC achieves up to $4\times$ reduction in shuffle load, with aggregation-based compression (C-CDC) delivering an additional 20–30% saving.
As popularity skew increases, redundancy is concentrated on a few highly requested files, and the two-group heuristic approaches the theoretical optimum.

Summary Table: CDC Scheme Capabilities

CDC Scheme	Heterogeneous $M_k$ , $W_k$	Nonuniform Popularity ( $p_n$ )	Compressed/Aggregate-IV	Low-Complexity Approximation	Multicasting Method
(Deng et al., 2023)	Yes	Yes	Yes (C-CDC)	Two-file-group	Nested coded shuffling

The flexible heterogeneous CDC paradigm (Deng et al., 2023)—which jointly optimizes worker-aware file placement and generalized coded shuffling, incorporates file popularity, and exploits data aggregation—substantially improves shuffle communication efficiency in practical distributed systems. This framework extends classical CDC to heterogeneous, data-skewed, and ML-driven settings with scalable solution methods and analytically quantifiable performance.

Markdown Report Issue Upgrade to Chat

References (1)

Design and Optimization of Heterogeneous Coded Distributed Computing with Nonuniform File Popularity (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Coded Distributed Computing (CDC).