SpDISTAL: Distributed Sparse Tensor Compiler
- SpDISTAL is a compiler framework that unifies expressive sparse tensor algebra with efficient distributed execution on both CPUs and GPUs.
- It abstracts tensor computations, data formats, and scheduling using dedicated DSLs to enable composable partitioning and optimal load balancing.
- Performance evaluations show significant speedups—often up to 100×—over traditional libraries and interpretation-based frameworks in key benchmarks.
SpDISTAL is a compiler framework for distributed execution of sparse tensor algebra expressions, designed to address the needs of large-scale scientific simulation, graph analytics, and sparse machine learning workloads. SpDISTAL separates the specification of tensor computations, data structures, data distribution, and computation scheduling, and compiles these to efficient distributed code targeted at both CPU and GPU resources. The system combines high-level expressiveness with performance competitive with domain-specific hand-written kernels, offering support for composable distribution schemes, a variety of tensor formats, and sophisticated scheduling idioms (Yadav et al., 2022).
1. Motivation and Context
Sparse tensor algebra is foundational to many large-scale computational domains, including scientific applications and machine learning. Existing approaches for distributed sparse computation fall into two classes: library-based systems (e.g., PETSc, Trilinos) and interpretation-based frameworks (e.g., Cyclops Tensor Framework, CTF). The former trade generality for performance but limit users to narrow format and operation sets, while the latter admit arbitrary tensor expressions but incur severe performance penalties due to interpretive overhead and inefficient data movement. The central motivation for SpDISTAL is to unify the generality and programmability of interpretation-based frameworks with the performance of hand-specialized kernels (Yadav et al., 2022).
A key challenge arises from the need to represent and efficiently distribute computations that involve arbitrary tensor expressions, multiple sparse data formats, complex distribution strategies (including balancing nonzero elements across nodes), and disparate hardware architectures.
2. Compiler Architecture and Abstractions
SpDISTAL introduces a front-end accepting three cooperative domain-specific languages (DSLs):
- Tensor Index Notation (TIN): Specifies the algebraic computation, supporting sums, products, reductions, and higher-order contractions. For example, sparse matrix-vector multiplication is written as
1 |
a(i) = B(i, j) * c(j) |
- Format Language: Assigns each tensor dimension a level format, such as "Dense" or "Compressed". The compressed format utilizes coordinate (
crd[]) and positional (pos[]) arrays, supporting layouts like CSR, CSC, and hybrids. - Tensor-Distribution Notation (TDN): Specifies how tensor indices are distributed across an abstract machine grid. Mechanisms include universe partitions for index ranges, nonzero partitions for balancing nonzero elements across processors, and coordinate fusion for collapsing dimensions.
The scheduling language allows users to compose transformations—such as .divide, .distribute, .communicate, .parallelize—layering them over both dense and sparse iterations. These abstractions are inherited and extended from TACO and DISTAL but specifically adapted for distributed, sparse settings (Yadav et al., 2022).
3. Scheduling, Partitioning, and Execution
SpDISTAL compiles the user’s tensor computation specification into a scheduled abstract syntax tree (AST). The generated code executes in distributed environments, partitioning tensor coordinate spaces according to user-specified schedules and formats.
- Partitioning Model: Tensors are represented via coordinate trees, with levels corresponding to dimensions. Universe partitioning divides the index space, while nonzero partitioning strips nonzero positions and distributes them for load balancing. The system supports dependent partitioning through image (propagation from parent to child) and preimage (child to parent) operations for complex pointer relationships among tensor indices.
- Execution Workflow:
- Partition each tensor’s coordinate tree for the given processor grid.
- Emit distributed for-loops to spawn per-block tasks across the cluster.
- Inside each task, run single-node sparse kernels either on CPU or GPU, using TACO’s code generation facilities.
The runtime targets Legion, a task-based distributed system capable of efficiently managing index spaces, memory regions, and data movement across heterogeneous resources. Auto-generated code may invoke libraries such as cuSPARSE or cuBLAS on GPU for the computation-intensive innermost loops.
Example pseudo-code for row-partitioned SpMV demonstrates how partitioning logic and kernel generation are cleanly separated:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
void SpMV(Tensor a, Tensor B, Tensor c, int pieces) { initUniversePartition(B.dom); for (int io = 0; io < pieces; io++) { int lo = io*(n/pieces), hi = (io+1)*(n/pieces); createUniversePartitionEntry(B.dom, io, {lo, hi}); } auto BdomPart = finalizeUniversePartition(B.dom); auto BposPart = copyPartition(BdomPart, B.pos); auto BcrdPart = image(BposPart, B.crd); distributed for io in 0…pieces { auto Bsub = B.subregion({BdomPart[io], BposPart[io], BcrdPart[io]}); for ii in 0…(blocksize–1) { int i = blockOffset(io,ii); for (int p = B.pos[i].lo; p < B.pos[i].hi; p++) { int j = B.crd[p]; a.vals[i] += B.vals[p]*c.vals[j]; } } } } |
4. Performance Results and Empirical Findings
Extensive evaluation was performed on Lassen (IBM Power9 nodes with multi-core CPUs, NVIDIA V100 GPUs, Infiniband interconnect), using 14 real-world sparse tensors/matrices (SuiteSparse, FROSTT, Freebase, – nonzeros) and multiple kernels (SpMV, SpMM, SpAdd3, SDDMM, SpTTV, SpMTTKRP).
Summary of results:
| Benchmark | SpDISTAL vs PETSc & Trilinos (CPU, GPU) | SpDISTAL vs CTF (Interpretation) | Highlights |
|---|---|---|---|
| SpMV/SpMM | Median 1.8×/2.0× speedup | 100–300× speedup | Performance gains from deferred execution, load balancing |
| SpAdd3 (fusion) | 10–40× speedup | N/A | Fused kernel avoids temporaries required by PETSc/Trilinos |
| SpAdd3 (GPU) | 20–100× speedup over Trilinos | N/A | Efficient fused GPU kernel |
| SDDMM, SpTTV, SpMTTKRP (GPU) | 2–5× speedup over CPU kernels | N/A | Schedules leverage nonzero-balance and high-performance GPU kernels |
| Weak-scaling | CPU: 90–92% of PETSc; GPU: up to 1.3× PETSc | N/A | Overlapping communication and compute in Legion runtime |
SpDISTAL achieves performance at parity or better than domain-optimized libraries for conventional kernels, and significantly outperforms interpretation-based frameworks due to compiled specialization and efficient data partitioning (Yadav et al., 2022).
5. Design Principles and Supported Features
SpDISTAL’s architecture is guided by the following principles:
- Separation of Concerns: Algebra (TIN), format (TACO DSL), data distribution (TDN), and scheduling are specified independently, preventing combinatorial kernel generation and allowing problem-specific specialization via composition.
- Composable Scheduling Language: Supports loop- and task-level transformations as well as sparse iteration-space manipulation for dense and sparse dimensions.
- Modular Partitioning Abstraction: Encapsulates universe/nonzero partitioning and parent/child partition propagation through a set of well-defined API routines (
initUniversePartition,partitionFromParent, etc.).
Supported features include:
- Universe and nonzero partitioning for load balancing
- Dependent partitioning (image/preimage) for multi-level tensor trees
- Sparse and dense tensor formats (e.g., Dense, Compressed) with extensible format interface
- Heterogeneous targeting (multi-core CPUs, GPUs, or a mix) from a single binary via Legion
- Distributed symbolic assembly for unpredictable output sparsity patterns (current implementation uses a two-phase assembly) (Yadav et al., 2022)
6. Limitations and Future Directions
Current limitations include:
- Output Sparsity Handling: The compiler currently uses a symbolic pass and extra allocation for results with unpredicted sparsity patterns; fully distributed symbolic assembly with minimal overhead remains a target for future work.
- Additional Formats: Only Dense and Compressed formats are fully supported; extensions to other formats (ELLPACK, DIA, hash-maps, custom block-sparse) are straightforward but not implemented.
- Autotuning: Schedule choices (block size, fusion points, device mapping) are user-driven; integrating an autotuner (akin to Halide or TVM) is a logical direction to further improve performance portability.
- Alternate Runtimes: While the present backend targets Legion, the partitioning API is designed for retargetability (e.g., MPI+RDMA), facilitating future support for diverse distributed execution frameworks.
A plausible implication is that SpDISTAL’s approach—combining DSL-based separation of algebraic, distributional, and scheduling concerns with a flexible, partition-oriented compiler backend—may inform the design of future high-level distributed computing systems beyond sparse tensor algebra (Yadav et al., 2022).