Papers
Topics
Authors
Recent
2000 character limit reached

K-Dimensional Tensor Tiling

Updated 30 December 2025
  • K-dimensional tensor tiling is a method that partitions multidimensional tensors across devices to minimize communication overhead in parallel deep learning.
  • It unifies data, model, and hybrid parallelism by formulating the tiling decision as a combinatorial optimization problem with explicit cost modeling.
  • The SoyBean system operationalizes these algorithms, demonstrating significant empirical speedups by automatically transforming serial graphs into parallelized execution.

K-dimensional tensor tiling is a systematic method for splitting multidimensional tensors across a set of computational devices such that the overall communication overhead is minimized. Central to parallel deep learning, this framework subsumes data parallelism, model parallelism, and their hybrids by representing parallel strategies as tilings of tensor dimensions. Rigorous treatment of the tensor tiling problem involves combinatorial search over partitionings, explicit modeling of communication costs induced by tiling decisions, and algorithmic solutions that are optimal under practical deep neural network (DNN) architectures. The SoyBean system exemplifies the operationalization of these concepts, transforming serial computational graphs into automatically parallelized forms through optimal tensor tiling (Wang et al., 2018).

1. Formal Definition and Canonical Problem Structure

Let TT be a KKth-order tensor of shape (n1,n2,,nK)(n_1, n_2, \ldots, n_K). A KK-dimensional tiling partitions each tensor mode kk into pkp_k equal, disjoint shards, with pkN+p_k \in \mathbb{N}^+ and each tile of length nk/pkn_k/p_k (assuming nkn_k divisible by pkp_k). The full tiling is specified by the tuple p=(p1,p2,,pK)p = (p_1, p_2, \ldots, p_K). For PP devices, the partitioning must satisfy k=1Kpk=P\prod_{k=1}^K p_k = P, with devices indexed by KK-vectors (i1,...,iK)(i_1, ..., i_K), 0ik<pk0 \leq i_k < p_k.

The search space of tensor tilings encompasses all KK-tuples of integers (p1,...,pK)(p_1, ..., p_K) with k=1Kpk=P\prod_{k=1}^K p_k = P. With increasing device count, the number of possible splits grows combinatorially. Expanding the search space to kpkP\prod_{k}p_k \geq P is possible by replicating tiles, though the fundamental challenge remains the optimal split configuration.

2. Communication-Cost Modeling for Operator Graphs

A neural network dataflow graph GG consists of tensor operations (edges EE). Each operator eEe \in E reads input tensors, applies functions, and produces output tensors. Once a global tiling pp is assigned, tensor partitions induce “halo” exchanges or reductions when tiles cross device boundaries.

The total communication volume under a given tiling is formulated as

C(p)=eEwefe(p)C(p) = \sum_{e \in E} w_e \cdot f_e(p)

where wew_e denotes the per-element data size (in bytes) or operation-specific weights, and fe(p)f_e(p) computes the number of elements communicated across devices for operator ee under tiling pp.

Examining matrix-matrix multiplication, each possible tiling (assignment of splits or replications to operands and results) leads to distinct communication patterns, often falling into a small set of “aligned” cases: those incurring zero communication and those requiring a reduction of partial results. The cost function

c(tX,tY,tZ)=min{c(tXR)+c(tYr)+c(tZR),c(tXr)+c(tYC)+c(tZC),c(tXC)+c(tYR)+c(tZred)}c(t_X, t_Y, t_Z) = \min\left\{c(t_X \to R) + c(t_Y \to r) + c(t_Z \to R),\, c(t_X\to r) + c(t_Y \to C) + c(t_Z\to C),\, c(t_X\to C) + c(t_Y\to R) + c(t_Z\to \text{red})\right\}

governs the conversion and reduction costs between tiling states.

Special parallelization cases correspond to:

  • Data parallelism: p=(P,1,1,,1)p=(P,1,1,\ldots,1) (batch split, other modes replicated).
  • Model parallelism: pk=Pp_k=P for some mode kk, all others pj=1p_j=1 (e.g., channel split).
  • Hybrid parallelism: hierarchical multi-mode cuts, e.g., p=(2,2,1,)p=(2,2,1,\ldots), which reflects splitting one mode, then another within each subgroup.

3. Globally Optimal Tiling Algorithms

The optimization problem requires joint selection of splits across all tensors due to dependency couplings in GG. Independent tiling per tensor is suboptimal. The solution involves a multistage recursion:

a) One-cut (2-way) tiling:

  • Transform GG to an undirected variant GG', unroll via BFS into levels L0,L1,L_0, L_1, \ldots.
  • Dynamic programming computes g(τ)g_\ell(\tau), the minimal communication up to level \ell with boundary tensors in tiling τ\tau:

g(τ)=minτ1[g1(τ1)+level_cost(τ1,τ)]g_\ell(\tau_\ell) = \min_{\tau_{\ell-1}} [g_{\ell-1}(\tau_{\ell-1}) + \text{level\_cost}_\ell(\tau_{\ell-1}, \tau_\ell)]

  • For chain-like DNNs, the DP has O(NT1c)O(N \cdot |T^1|^c) complexity (where T1=K+1|T^1| = K+1 and cc is small).

b) Recursive kk-cut for P=2kP=2^k devices:

  • Recursively apply one-cut tiling to split into two groups, then within each group solve for k1k-1 further cuts.
  • Let (Pk,δk)=OneCut(G)(\mathcal{P}_k, \delta_k) = \text{OneCut}(G). Define recursively:
    1
    2
    3
    4
    5
    6
    7
    8
    
    Algorithm kCut(G, k):
      if k=0 return (all-replicated, cost=0)
      (P_k, δ_k) = OneCutTiling(G)
      G' = rebuild-graph(G, P_k)
      (T_{k-1}, c_{k-1}) = kCut(G', k-1)
      T_k = P_k ∘ T_{k-1}
      c_k = δ_k + 2·c_{k-1}
      return (T_k, c_k)
  • The total communication cost is ck=i=1k(2kiδi)c_k = \sum_{i=1}^k (2^{k-i} \cdot \delta_i). The recursion is globally optimal in polynomial time due to cut commutativity (“flattening theorem”) and the greedy property δi2δi1\delta_i \leq 2\delta_{i-1}. Overall, complexity is O(kN(K+1)c)O(k \cdot N \cdot (K+1)^c).

4. Canonical Example: 4D Convolution Tiling

Consider a convolutional layer where the activation tensor ARB×C×H×WA \in \mathbb{R}^{B \times C \times H \times W} and filter tensor FRO×C×R×SF \in \mathbb{R}^{O \times C \times R \times S}. The convolution output O=conv(A,F)O = \text{conv}(A,F) has shape B×O×H×WB \times O \times H' \times W', K=4K=4 for AA.

Three canonical parallelization/tiling schemes:

  • Data parallel (p=(P,1,1,1)p=(P,1,1,1)): batch dimension split; no forward communication; backward all-reduce on FF (comm2size(F)\text{comm} \approx 2\, \text{size}(F)).
  • Model parallel (p=(1,P,1,1)p=(1,P,1,1)): split in-channel CC; requires reduce-sum on OO and analogous backward exchanges (comm2size(A)HW/P\text{comm} \approx 2\,\text{size}(A)\,H\,W/P).
  • Hybrid (p=(P1,P2,1,1) with P1P2=Pp=(P_1,P_2,1,1) \text{ with } P_1P_2=P): first split batch, then split channels within groups; total communication is δ1=2size(F)\delta_1 = 2\,\text{size}(F) (all-reduce), δ2=2size(A)/P1HW\delta_2 = 2\,\text{size}(A)/P_1\,H\,W (per group), total =δ1+P1δ2= \delta_1 + P_1 \delta_2.

Numerical example:

  • B=256B=256, C=128C=128, H=W=28H=W=28, O=256O=256, R=S=3R=S=3
  • size(A)25\text{size}(A) \approx 25M, size(F)0.3\text{size}(F) \approx 0.3M bytes
  • Data-parallel: $0.6$M bytes/iteration, model-parallel: $50$M, hybrid varies (e.g., $0.6$M ++ 4×12.54\times12.5M == $50.6$M for P1=P2=4P_1=P_2=4), with further trade-offs depending on P1P_1, P2P_2.

The dynamic programming/kk-cut search automatically explores these hybrid strategies and selects the minimum cost configuration.

5. System Integration: SoyBean Architecture and Empirical Results

SoyBean processes a serial dataflow graph (e.g., from MXNet or TensorFlow) and performs:

  • Optimal Tiling: Executes the kk-cut algorithm to assign tensor-specific tile vectors T(p)T(p).
  • Device Placement: Maps KK-index blocks to physical devices, prioritizing hardware hierarchy (first slow links, then internal cuts).
  • Graph Rewriting: Expands operators into PP sub-operators for corresponding shards and inserts required halo-exchange or all-reduce operations to handle tiling conversions.
  • Execution: Dispatches the partitioned graph to the standard dataflow runtime.

Empirical speedups on 8-GPU hardware:

  • AlexNet (batch 256): SoyBean achieves 7×\approx 7\times single-GPU speedup, whereas data parallelism requires batch 1024 for comparable scaling.
  • VGG (batch 256): SoyBean achieves 5–6×\times at 8 GPUs; data-parallel peaks at 4×\approx 4\times unless batch \gg 256.
  • General result: Across AlexNet/VGG and batch sizes, SoyBean is 1.5–4×\times faster than pure data parallelism, as it identifies hybrid splits minimizing C(p)C(p) (Wang et al., 2018).

6. Implications and Generalization of K-Dimensional Tensor Tiling

  • Expressiveness: Any parallelization choice, including data, model, and mixed strategies, can be posed as a KK-vector pp of splits.
  • Optimality: For chain-structured DNN graphs, the kk-cut algorithm is provably globally optimal in polynomial time.
  • Extendability: The tiling set T1={T^1 = \{partition along any mode kk, replicate}\} can be expanded to support advanced splits (e.g., group convolution partitions), with the dynamic programming/kk-cut machinery still directly applicable.
  • Systems Integration: Elevating tensor tiling to a primary systems abstraction enables unification and outperformance of hand-tuned parallelism strategies, serving as a functional backend for any dataflow-based deep learning system.

7. Table: Parallelism Schemes under K-Dimensional Tensor Tiling

Parallelism Type Tiling tuple pp Communication Pattern
Data Parallelism (P,1,1,,1)(P,1,1,\ldots,1) Batch split, all-reduce on weights
Model Parallelism (1,P,1,,1)(1,P,1,\ldots,1) Split one model axis, reduce on output
Hybrid Parallelism (P1,P2,)(P_1,P_2,\ldots) Hierarchical, mixes splits and reduces

These canonical strategies exemplify how the tensor tiling framework encapsulates parallelism choices and highlights the trade-offs in communication cost and empirical efficiency (Wang et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to K-Dimensional Tensor Tiling.