K-Dimensional Tensor Tiling
- K-dimensional tensor tiling is a method that partitions multidimensional tensors across devices to minimize communication overhead in parallel deep learning.
- It unifies data, model, and hybrid parallelism by formulating the tiling decision as a combinatorial optimization problem with explicit cost modeling.
- The SoyBean system operationalizes these algorithms, demonstrating significant empirical speedups by automatically transforming serial graphs into parallelized execution.
K-dimensional tensor tiling is a systematic method for splitting multidimensional tensors across a set of computational devices such that the overall communication overhead is minimized. Central to parallel deep learning, this framework subsumes data parallelism, model parallelism, and their hybrids by representing parallel strategies as tilings of tensor dimensions. Rigorous treatment of the tensor tiling problem involves combinatorial search over partitionings, explicit modeling of communication costs induced by tiling decisions, and algorithmic solutions that are optimal under practical deep neural network (DNN) architectures. The SoyBean system exemplifies the operationalization of these concepts, transforming serial computational graphs into automatically parallelized forms through optimal tensor tiling (Wang et al., 2018).
1. Formal Definition and Canonical Problem Structure
Let be a th-order tensor of shape . A -dimensional tiling partitions each tensor mode into equal, disjoint shards, with and each tile of length (assuming divisible by ). The full tiling is specified by the tuple . For devices, the partitioning must satisfy , with devices indexed by -vectors , .
The search space of tensor tilings encompasses all -tuples of integers with . With increasing device count, the number of possible splits grows combinatorially. Expanding the search space to is possible by replicating tiles, though the fundamental challenge remains the optimal split configuration.
2. Communication-Cost Modeling for Operator Graphs
A neural network dataflow graph consists of tensor operations (edges ). Each operator reads input tensors, applies functions, and produces output tensors. Once a global tiling is assigned, tensor partitions induce “halo” exchanges or reductions when tiles cross device boundaries.
The total communication volume under a given tiling is formulated as
where denotes the per-element data size (in bytes) or operation-specific weights, and computes the number of elements communicated across devices for operator under tiling .
Examining matrix-matrix multiplication, each possible tiling (assignment of splits or replications to operands and results) leads to distinct communication patterns, often falling into a small set of “aligned” cases: those incurring zero communication and those requiring a reduction of partial results. The cost function
governs the conversion and reduction costs between tiling states.
Special parallelization cases correspond to:
- Data parallelism: (batch split, other modes replicated).
- Model parallelism: for some mode , all others (e.g., channel split).
- Hybrid parallelism: hierarchical multi-mode cuts, e.g., , which reflects splitting one mode, then another within each subgroup.
3. Globally Optimal Tiling Algorithms
The optimization problem requires joint selection of splits across all tensors due to dependency couplings in . Independent tiling per tensor is suboptimal. The solution involves a multistage recursion:
a) One-cut (2-way) tiling:
- Transform to an undirected variant , unroll via BFS into levels .
- Dynamic programming computes , the minimal communication up to level with boundary tensors in tiling :
- For chain-like DNNs, the DP has complexity (where and is small).
b) Recursive -cut for devices:
- Recursively apply one-cut tiling to split into two groups, then within each group solve for further cuts.
- Let . Define recursively:
1 2 3 4 5 6 7 8
Algorithm kCut(G, k): if k=0 return (all-replicated, cost=0) (P_k, δ_k) = OneCutTiling(G) G' = rebuild-graph(G, P_k) (T_{k-1}, c_{k-1}) = kCut(G', k-1) T_k = P_k ∘ T_{k-1} c_k = δ_k + 2·c_{k-1} return (T_k, c_k) - The total communication cost is . The recursion is globally optimal in polynomial time due to cut commutativity (“flattening theorem”) and the greedy property . Overall, complexity is .
4. Canonical Example: 4D Convolution Tiling
Consider a convolutional layer where the activation tensor and filter tensor . The convolution output has shape , for .
Three canonical parallelization/tiling schemes:
- Data parallel (): batch dimension split; no forward communication; backward all-reduce on ().
- Model parallel (): split in-channel ; requires reduce-sum on and analogous backward exchanges ().
- Hybrid (): first split batch, then split channels within groups; total communication is (all-reduce), (per group), total .
Numerical example:
- , , , ,
- M, M bytes
- Data-parallel: $0.6$M bytes/iteration, model-parallel: $50$M, hybrid varies (e.g., $0.6$M M $50.6$M for ), with further trade-offs depending on , .
The dynamic programming/-cut search automatically explores these hybrid strategies and selects the minimum cost configuration.
5. System Integration: SoyBean Architecture and Empirical Results
SoyBean processes a serial dataflow graph (e.g., from MXNet or TensorFlow) and performs:
- Optimal Tiling: Executes the -cut algorithm to assign tensor-specific tile vectors .
- Device Placement: Maps -index blocks to physical devices, prioritizing hardware hierarchy (first slow links, then internal cuts).
- Graph Rewriting: Expands operators into sub-operators for corresponding shards and inserts required halo-exchange or all-reduce operations to handle tiling conversions.
- Execution: Dispatches the partitioned graph to the standard dataflow runtime.
Empirical speedups on 8-GPU hardware:
- AlexNet (batch 256): SoyBean achieves single-GPU speedup, whereas data parallelism requires batch 1024 for comparable scaling.
- VGG (batch 256): SoyBean achieves 5–6 at 8 GPUs; data-parallel peaks at unless batch 256.
- General result: Across AlexNet/VGG and batch sizes, SoyBean is 1.5–4 faster than pure data parallelism, as it identifies hybrid splits minimizing (Wang et al., 2018).
6. Implications and Generalization of K-Dimensional Tensor Tiling
- Expressiveness: Any parallelization choice, including data, model, and mixed strategies, can be posed as a -vector of splits.
- Optimality: For chain-structured DNN graphs, the -cut algorithm is provably globally optimal in polynomial time.
- Extendability: The tiling set partition along any mode , replicate can be expanded to support advanced splits (e.g., group convolution partitions), with the dynamic programming/-cut machinery still directly applicable.
- Systems Integration: Elevating tensor tiling to a primary systems abstraction enables unification and outperformance of hand-tuned parallelism strategies, serving as a functional backend for any dataflow-based deep learning system.
7. Table: Parallelism Schemes under K-Dimensional Tensor Tiling
| Parallelism Type | Tiling tuple | Communication Pattern |
|---|---|---|
| Data Parallelism | Batch split, all-reduce on weights | |
| Model Parallelism | Split one model axis, reduce on output | |
| Hybrid Parallelism | Hierarchical, mixes splits and reduces |
These canonical strategies exemplify how the tensor tiling framework encapsulates parallelism choices and highlights the trade-offs in communication cost and empirical efficiency (Wang et al., 2018).