Sparse Tensor Benchmarking
- Sparse tensor benchmarking is a systematic evaluation of algorithms, kernels, data structures, and hardware for high-dimensional sparse data arrays with precise, reproducible metrics.
- It encompasses canonical operations like TEW, TTV, TTM, and MTTKRP, using both real-world datasets and synthetic generators to capture realistic sparsity and workload variability.
- Advanced optimizations in data formats and parallel implementations, supported by roofline modelling, drive innovation in software and hardware performance for sparse computations.
Sparse tensor benchmarking addresses the systematic evaluation of algorithms, kernels, data structures, and hardware/software platforms for operations on high-dimensional, sparse data arrays. Given the algorithmic complexity, memory and parallelization challenges, and rapidly increasing diversity of use cases—ranging from scientific computing and quantum chemistry to machine learning and graph analytics—comprehensive benchmarking has become foundational for both software and hardware innovation. Benchmarking methodologies have evolved to cover kernels, formats, accelerators, and workloads with precision and reproducibility.
1. Benchmarking Kernels, Workloads, and Methodologies
Sparse tensor benchmarks typically encompass canonical kernels such as element-wise operations (TEW/TEW-eq), tensor-scalar operations (TS), tensor-times-vector (TTV), tensor-times-matrix (TTM), and the Matricized Tensor-Times-Khatri-Rao Product (MTTKRP). The PASTA suite, for example, formalizes these as core computational primitives, exposing both their arithmetic intensity and strong-scaling behavior (Li et al., 2019). Hierarchical suites on CPUs and GPUs (e.g., the Parallel Sparse Tensor Benchmark Suite) extend to arbitrary tensor order, span both coordinate-based (COO) and block/hierarchical formats (HiCOO), and are implemented for OpenMP and CUDA (Li et al., 2020).
Benchmarks must select datasets that exhibit realistic structural sparsity and order, including both real-world tensors (FROSTT, SuiteSparse, etc.) and well-characterized synthetic generators. The reproducibility and relevance of results depend heavily on stress-testing across a range of densities (10⁻³–10⁻¹⁵), orders (3–5), and workload types (memory vs compute bound).
2. Data Structures and Format Implications
Sparse tensor performance is highly sensitive to storage format. Baseline COO, enhanced global LCO (Linearised Coordinate Order), and hierarchical or block-encoded variants (HiCOO) are utilized depending on the kernel and target architecture (Li et al., 2020, Harrison et al., 2018). The LCO format encodes coordinates into a single integer, admitting fast lexicographic comparisons and efficient radix-based permutations; RP kernels accelerate lex-order rearrangement (up to 2× faster compared to classical separate-coordinate formats) and reduce sort/permute overhead for expression-tree traversal (Harrison et al., 2018).
On GPUs, the “flagged” F-COO format integrates mode-flagging to enable one-shot, coalesced segmented-scan style reductions, radically reducing global memory and atomic-update costs for contraction and factorization kernels (Liu et al., 2017).
Efficient format selection, investigated with extracted features (FeaTen), is essential for storage, partitioning, and decomposition autotuning. Benchmarks must profile format performance with high-dimensional feature vectors, including counts, densities, coefficients of variation, and imbalance metrics across slices/fibers, all of which can be extracted efficiently using grouping- or hybrid-based methods (Torun et al., 2024).
3. Parallelization, Hardware, and Roofline Modelling
The interplay between algorithmic structure and hardware is systematically characterized via hierarchical strong-scaling studies (e.g., 1–32 threads on dual-socket Xeon, 1–64 threads for NUMA or Threadripper, CUDA/Tesla V100 GPUs), and by empirical roofline modeling (Li et al., 2019, Li et al., 2020, Liu et al., 2017). Roofline models, parameterized by arithmetic intensity, analytically bound kernel performance as the lesser of compute and memory bandwidth limitations. Achievable GFLOP/s and GB/s are computed empirically per kernel, format, platform, and workload.
Parallel implementations must attend to fiber-level or slice-level load balancing (TTV/TTM/MTTKRP: parallel over mode-n fibers or privatized accumulations). Bottlenecks include memory bandwidth for low-op-intensity kernels, reduction/atomic update costs for output-dense contractions, and NUMA-induced bandwidth starvation. Highly-optimized codegen frameworks (e.g., SparseLNR) apply loop-fusion and tiling principles to reduce working-set sizes, improve locality, and enable robust scaling in both single- and multi-threaded regimes (Dias et al., 2022).
Accelerator evaluation, using high-level declarative DSLs such as TeAAL, enables cycle-accurate, data-driven performance, memory-, and energy-modelling of specialized hardware. By assembling executions into cascaded Einsum DAGs and mapping dataflow, partitioning, and iteration strategies, one can both reproduce published throughput (to single-digit percent error) and expose new hardware/software tradeoffs in parallelization, buffering, and intersection/merge units (Nayak et al., 2023).
4. Algorithmic Innovations and Kernel-Specific Benchmarks
Kernel-specific algorithmic advances are central to benchmarking. For sparse tensor contraction, Swift achieves up to 20× speedup over classic Sparta-style sort-then-contract pipelines by employing linear-time grouping, efficient pointer-indexed contiguous layouts, and linear-probe hash tables for output accumulation—this is especially impactful for high sparsity and operand-imbalance regimes (Ensinger et al., 2024).
For sparse tensor arithmetic, poly-algorithms that dispatch among CSC/CSR, DCSC/DCSR, and special outer-product strategies depending on hyper-sparsity and row/col-dominance enable up to 30× speedup for index-sparse products. Benchmarks must rigorously vary the relative shape and sparsity to map out these algorithmic “phase diagrams” (Harrison et al., 2018).
Sparse tensor algebra compilers (TACO, SparseLNR, workspace-augmented index notation) demonstrate the impact of code generation and scheduling on practical performance, with workspace insertion strategies yielding up to 2×–6× speedup for multi-operand addition or MTTKRP, and bridge the gap between compiler and hand-optimized library code (Kjolstad et al., 2018, Dias et al., 2022).
Specialized transposition routines (Quesadilla) reduce the number of histogram or radix passes to the theoretical minimum required by the desired coordinate permutation, outperforming prior art such as SPLATT and Top-K-sadilla in both serial (median speedup ≈1.19×, 58% best-of cases) and parallel settings (Mueller et al., 2020).
5. Dataset and Feature Suite Generation
Reproducibility and realism in benchmarking require datasets that embody the structural features of real-world applications. Synthetic generators (GenTen) create tensors that replicate the statistical distribution of fiber, slice, and nonzero counts found in application datasets, parameterized by user-controlled density and coefficient of variation. Feature extractors (FeaTen) enable large-scale, parallel computation of per-mode, per-slice/fiber, and global statistics, facilitating both downstream format selection (e.g., via SpTFS) and detailed performance diagnosis (Torun et al., 2024).
A benchmarking pipeline involves: extracting feature vectors from real tensors; sampling or sweeping over seed/variation space in GenTen to synthesize realistic but scalable tensors; profiling algorithmic/format/runtime behavior using feature-guided analysis; and iteratively updating the suite to cover novel structures or extreme regimes.
6. Reference Metrics, Methodological Best Practices, and Analysis
Performance metrics include wall-clock time, strong-scaling speedup (Sₚ = T₁/Tₚ), achieved GFLOP/s and GB/s, memory footprint (heap usage or per-nonzero), arithmetic intensity (flops/byte), and relative memory or bandwidth efficiency to hardware rooflines. Comparisons are always made against multiple baselines: legacy libraries (e.g., MATLAB MTT), prior high-performance C/C++ implementations, and modern frameworks/libraries on state-of-the-art hardware (Li et al., 2019, Harrison et al., 2018, Liu et al., 2017, Li et al., 2020).
Key best practices include: partitioning by fibers or slices for scalable computation; use of output pre-allocation and thread-private buffers for concurrent updates; concrete isolation of index construction from numeric kernels to disentangle data-locality from arithmetic intensity; strong-scaling plots over relevant thread/core ranges; thorough reporting of both synthetic and real-world workloads and all architectural configurations; and publication of source code, scripts, and datasets for full reproducibility (Li et al., 2019, Li et al., 2020, Torun et al., 2024, Dias et al., 2022).
Controversies or pitfalls include: naive generalization of matrix-centric algorithms to high-order cases (leading to exponential fill-in or memory blowup), lack of memory-efficient support for hyper-sparse or imbalanced workloads, and underreporting of platform-specific bottlenecks (cache sizes, atomic bandwidth, NUMA placement). Stabilizing benchmarks on irregular data distributions and extreme sparsity regimes remains a persistent challenge.
In summary, sparse tensor benchmarking is a technically intricate and rapidly maturing discipline, anchored in robust workload design, kernel and format diversity, precision metrics, and co-development of software and hardware abstractions. The canonical literature (Li et al., 2019, Harrison et al., 2018, Dias et al., 2022, Liu et al., 2017, Torun et al., 2024, Li et al., 2020, Ensinger et al., 2024, Nayak et al., 2023, Mueller et al., 2020) has established rigorous methodological principles and produced open suites that enable direct, quantitative comparison of new algorithms, data structures, compilers, and accelerators across the expanding landscape of sparse tensor computation.