K-Dimensional Tensor Tiling

Updated 30 December 2025

K-dimensional tensor tiling is a method that partitions multidimensional tensors across devices to minimize communication overhead in parallel deep learning.
It unifies data, model, and hybrid parallelism by formulating the tiling decision as a combinatorial optimization problem with explicit cost modeling.
The SoyBean system operationalizes these algorithms, demonstrating significant empirical speedups by automatically transforming serial graphs into parallelized execution.

K-dimensional tensor tiling is a systematic method for splitting multidimensional tensors across a set of computational devices such that the overall communication overhead is minimized. Central to parallel deep learning, this framework subsumes data parallelism, model parallelism, and their hybrids by representing parallel strategies as tilings of tensor dimensions. Rigorous treatment of the tensor tiling problem involves combinatorial search over partitionings, explicit modeling of communication costs induced by tiling decisions, and algorithmic solutions that are optimal under practical deep neural network (DNN) architectures. The SoyBean system exemplifies the operationalization of these concepts, transforming serial computational graphs into automatically parallelized forms through optimal tensor tiling (Wang et al., 2018).

1. Formal Definition and Canonical Problem Structure

Let $T$ be a $K$ th-order tensor of shape $(n_1, n_2, \ldots, n_K)$ . A $K$ -dimensional tiling partitions each tensor mode $k$ into $p_k$ equal, disjoint shards, with $p_k \in \mathbb{N}^+$ and each tile of length $n_k/p_k$ (assuming $n_k$ divisible by $p_k$ ). The full tiling is specified by the tuple $p = (p_1, p_2, \ldots, p_K)$ . For $P$ devices, the partitioning must satisfy $\prod_{k=1}^K p_k = P$ , with devices indexed by $K$ -vectors $(i_1, ..., i_K)$ , $0 \leq i_k < p_k$ .

The search space of tensor tilings encompasses all $K$ -tuples of integers $(p_1, ..., p_K)$ with $\prod_{k=1}^K p_k = P$ . With increasing device count, the number of possible splits grows combinatorially. Expanding the search space to $\prod_{k}p_k \geq P$ is possible by replicating tiles, though the fundamental challenge remains the optimal split configuration.

2. Communication-Cost Modeling for Operator Graphs

A neural network dataflow graph $G$ consists of tensor operations (edges $E$ ). Each operator $e \in E$ reads input tensors, applies functions, and produces output tensors. Once a global tiling $p$ is assigned, tensor partitions induce “halo” exchanges or reductions when tiles cross device boundaries.

The total communication volume under a given tiling is formulated as

$C(p) = \sum_{e \in E} w_e \cdot f_e(p)$

where $w_e$ denotes the per-element data size (in bytes) or operation-specific weights, and $f_e(p)$ computes the number of elements communicated across devices for operator $e$ under tiling $p$ .

Examining matrix-matrix multiplication, each possible tiling (assignment of splits or replications to operands and results) leads to distinct communication patterns, often falling into a small set of “aligned” cases: those incurring zero communication and those requiring a reduction of partial results. The cost function

$c(t_X, t_Y, t_Z) = \min\left\{c(t_X \to R) + c(t_Y \to r) + c(t_Z \to R),\, c(t_X\to r) + c(t_Y \to C) + c(t_Z\to C),\, c(t_X\to C) + c(t_Y\to R) + c(t_Z\to \text{red})\right\}$

governs the conversion and reduction costs between tiling states.

Special parallelization cases correspond to:

Data parallelism: $p=(P,1,1,\ldots,1)$ (batch split, other modes replicated).
Model parallelism: $p_k=P$ for some mode $k$ , all others $p_j=1$ (e.g., channel split).
Hybrid parallelism: hierarchical multi-mode cuts, e.g., $p=(2,2,1,\ldots)$ , which reflects splitting one mode, then another within each subgroup.

3. Globally Optimal Tiling Algorithms

The optimization problem requires joint selection of splits across all tensors due to dependency couplings in $G$ . Independent tiling per tensor is suboptimal. The solution involves a multistage recursion:

a) One-cut (2-way) tiling:

Transform $G$ to an undirected variant $G'$ , unroll via BFS into levels $L_0, L_1, \ldots$ .
Dynamic programming computes $g_\ell(\tau)$ , the minimal communication up to level $\ell$ with boundary tensors in tiling $\tau$ :

$g_\ell(\tau_\ell) = \min_{\tau_{\ell-1}} [g_{\ell-1}(\tau_{\ell-1}) + \text{level\_cost}_\ell(\tau_{\ell-1}, \tau_\ell)]$

For chain-like DNNs, the DP has $O(N \cdot |T^1|^c)$ complexity (where $|T^1| = K+1$ and $c$ is small).

b) Recursive $k$ -cut for $P=2^k$ devices:

Recursively apply one-cut tiling to split into two groups, then within each group solve for $k-1$ further cuts.

Let

(\mathcal{P}_k, \delta_k) = \text{OneCut}(G)

. Define recursively:

Algorithm kCut(G, k):
  if k=0 return (all-replicated, cost=0)
  (P_k, δ_k) = OneCutTiling(G)
  G' = rebuild-graph(G, P_k)
  (T_{k-1}, c_{k-1}) = kCut(G', k-1)
  T_k = P_k ∘ T_{k-1}
  c_k = δ_k + 2·c_{k-1}
  return (T_k, c_k)

The total communication cost is $c_k = \sum_{i=1}^k (2^{k-i} \cdot \delta_i)$ . The recursion is globally optimal in polynomial time due to cut commutativity (“flattening theorem”) and the greedy property $\delta_i \leq 2\delta_{i-1}$ . Overall, complexity is $O(k \cdot N \cdot (K+1)^c)$ .

4. Canonical Example: 4D Convolution Tiling

Consider a convolutional layer where the activation tensor $A \in \mathbb{R}^{B \times C \times H \times W}$ and filter tensor $F \in \mathbb{R}^{O \times C \times R \times S}$ . The convolution output $O = \text{conv}(A,F)$ has shape $B \times O \times H' \times W'$ , $K=4$ for $A$ .

Three canonical parallelization/tiling schemes:

Data parallel ( $p=(P,1,1,1)$ ): batch dimension split; no forward communication; backward all-reduce on $F$ ( $\text{comm} \approx 2\, \text{size}(F)$ ).
Model parallel ( $p=(1,P,1,1)$ ): split in-channel $C$ ; requires reduce-sum on $O$ and analogous backward exchanges ( $\text{comm} \approx 2\,\text{size}(A)\,H\,W/P$ ).
Hybrid ( $p=(P_1,P_2,1,1) \text{ with } P_1P_2=P$ ): first split batch, then split channels within groups; total communication is $\delta_1 = 2\,\text{size}(F)$ (all-reduce), $\delta_2 = 2\,\text{size}(A)/P_1\,H\,W$ (per group), total $= \delta_1 + P_1 \delta_2$ .

Numerical example:

$B=256$ , $C=128$ , $H=W=28$ , $O=256$ , $R=S=3$
$\text{size}(A) \approx 25$ M, $\text{size}(F) \approx 0.3$ M bytes
Data-parallel: $0.6$M bytes/iteration, model-parallel: $50$M, hybrid varies (e.g., $0.6$M $+$ $4\times12.5$ M $=$ $50.6$M for $P_1=P_2=4$ ), with further trade-offs depending on $P_1$ , $P_2$ .

The dynamic programming/ $k$ -cut search automatically explores these hybrid strategies and selects the minimum cost configuration.

5. System Integration: SoyBean Architecture and Empirical Results

SoyBean processes a serial dataflow graph (e.g., from MXNet or TensorFlow) and performs:

Optimal Tiling: Executes the $k$ -cut algorithm to assign tensor-specific tile vectors $T(p)$ .
Device Placement: Maps $K$ -index blocks to physical devices, prioritizing hardware hierarchy (first slow links, then internal cuts).
Graph Rewriting: Expands operators into $P$ sub-operators for corresponding shards and inserts required halo-exchange or all-reduce operations to handle tiling conversions.
Execution: Dispatches the partitioned graph to the standard dataflow runtime.

Empirical speedups on 8-GPU hardware:

AlexNet (batch 256): SoyBean achieves $\approx 7\times$ single-GPU speedup, whereas data parallelism requires batch 1024 for comparable scaling.
VGG (batch 256): SoyBean achieves 5–6 $\times$ at 8 GPUs; data-parallel peaks at $\approx 4\times$ unless batch $\gg$ 256.
General result: Across AlexNet/VGG and batch sizes, SoyBean is 1.5–4 $\times$ faster than pure data parallelism, as it identifies hybrid splits minimizing $C(p)$ (Wang et al., 2018).

6. Implications and Generalization of K-Dimensional Tensor Tiling

Expressiveness: Any parallelization choice, including data, model, and mixed strategies, can be posed as a $K$ -vector $p$ of splits.
Optimality: For chain-structured DNN graphs, the $k$ -cut algorithm is provably globally optimal in polynomial time.
Extendability: The tiling set $T^1 = \{$ partition along any mode $k$ , replicate $\}$ can be expanded to support advanced splits (e.g., group convolution partitions), with the dynamic programming/ $k$ -cut machinery still directly applicable.
Systems Integration: Elevating tensor tiling to a primary systems abstraction enables unification and outperformance of hand-tuned parallelism strategies, serving as a functional backend for any dataflow-based deep learning system.

7. Table: Parallelism Schemes under K-Dimensional Tensor Tiling

Parallelism Type	Tiling tuple $p$	Communication Pattern
Data Parallelism	$(P,1,1,\ldots,1)$	Batch split, all-reduce on weights
Model Parallelism	$(1,P,1,\ldots,1)$	Split one model axis, reduce on output
Hybrid Parallelism	$(P_1,P_2,\ldots)$	Hierarchical, mixes splits and reduces

These canonical strategies exemplify how the tensor tiling framework encapsulates parallelism choices and highlights the trade-offs in communication cost and empirical efficiency (Wang et al., 2018).

PDF Markdown Chat (Pro)

References (1)

Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to K-Dimensional Tensor Tiling.