SDDMM: Sampled Dense-Dense Matrix Multiplication

Updated 5 May 2026

SDDMM is a sparse linear algebra primitive that computes edge-specific dense products using a prescribed mask, critical in graph analytics and deep learning.
Efficient implementations leverage hardware/software co-design, distributed memory strategies, and optimized vectorized kernels on GPUs and other accelerators.
Fusing SDDMM with SpMM reduces memory bandwidth requirements and DRAM pressure, enabling significant runtime speedups in large-scale data-parallel applications.

Sampled Dense-Dense Matrix Multiplication (SDDMM) is a fundamental sparse linear algebra primitive critical for high-performance graph analytics, recommendation systems, document clustering, and deep learning workloads—particularly within the computational graphs of graph neural networks (GNNs) and collaborative filtering tasks. SDDMM computes edge-specific (i.e., nonzero-pattern-preserving) products between dense matrices, resulting in a sparse output aligned with a prescribed mask or pattern. Efficient SDDMM implementations are essential due to its recurring role as a performance bottleneck in real-world data-parallel and accelerator-based ML systems. Recent advancements have focused on hardware/software co-design, distributed memory scaling, communication minimization, and fusion with SpMM (Sparse × Dense Matrix Multiplication) for further efficiency.

1. Mathematical Formalism

Let $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{n \times r}$ be input dense matrices, and $S \in \{0,1\}^{m \times n}$ (or, more generally, $S$ sparse) denote a binary/sparse mask encoding desired nonzero locations. The SDDMM computes $R \in \mathbb{R}^{m \times n}$ such that: $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ for all $(i,j) \in \mathrm{nnz}(S)$ , and $R_{ij}=0$ elsewhere. In matrix notation: $R = S \odot (A B^\top)$ where $\odot$ is the Hadamard (element-wise) product. In practical SDDMM implementations, explicit computation is restricted to $B \in \mathbb{R}^{n \times r}$ 0 pairs where $B \in \mathbb{R}^{n \times r}$ 1, avoiding unnecessary dense operations. The operator generalizes semiring-based binary and user-defined “edge functions,” encompassing inner products, attention kernels, and more (Bharadwaj et al., 2022, Rahman et al., 2020, Abubaker et al., 2024).

2. Distributed-Memory and 3D Parallel Implementations

Modern SDDMM demands strong scalability across distributed-memory systems. Canonical layouts are adapted directly from communication-optimal SpMM schemes: 1.5D and 2.5D data distributions admit straightforward role reversal to SDDMM, preserving communication cost and data layout.

Processor Grid and Data Partitioning: Using a $B \in \mathbb{R}^{n \times r}$ 2 logical grid, $B \in \mathbb{R}^{n \times r}$ 3 is block-partitioned into $B \in \mathbb{R}^{n \times r}$ 4 subblocks, each further nonzero-wise divided among $B \in \mathbb{R}^{n \times r}$ 5 partitions. $B \in \mathbb{R}^{n \times r}$ 6 and $B \in \mathbb{R}^{n \times r}$ 7 are likewise decomposed and distributed.
Communication Scheme: Sparsity-aware approaches (e.g., SpComm3D) only transmit required $B \in \mathbb{R}^{n \times r}$ 8-rows to processors (point-to-point in $B \in \mathbb{R}^{n \times r}$ 9) and $S \in \{0,1\}^{m \times n}$ 0-rows (in $S \in \{0,1\}^{m \times n}$ 1), using directed zero-copy or bufferless strategies (MPI_Type_indexed), in contrast to dense 3D (sparsity-agnostic) approaches, which waste bandwidth and memory broadcasting all partitions regardless of utility.
Computation: Each processor computes its assigned block by evaluating the dot products only for its local set of nonzeros, sharing minimal results where output reduction is necessary.
Performance: For $S \in \{0,1\}^{m \times n}$ 2 and $S \in \{0,1\}^{m \times n}$ 3, SpComm3D achieves geometric mean speedup of $S \in \{0,1\}^{m \times n}$ 4 to $S \in \{0,1\}^{m \times n}$ 5 in SDDMM runtime compared to dense 3D across various graph matrices, with $S \in \{0,1\}^{m \times n}$ 6 network volume reduction and $S \in \{0,1\}^{m \times n}$ 7 lower memory per rank. Near-ideal strong scaling is observed up to $S \in \{0,1\}^{m \times n}$ 8 ranks, exceeding prior layouts (Abubaker et al., 2024, Bharadwaj et al., 2022).

3. Hardware-Accelerated and Vectorized Implementations

Efficient SDDMM execution requires tightly-optimized kernels matching hardware capabilities:

FlashSparse on NVIDIA Tensor Cores: Leverages swap-and-transpose MMA—by algebraic manipulation, vector length on the sparse side is halved (from 16 to 8), reducing zero padding and computation by 50%. With ME-BCRS storage (memory-efficient Blocked CSR), redundant memory and compute patterns are minimized. On H100 and RTX 4090 GPUs, FlashSparse achieves up to $S \in \{0,1\}^{m \times n}$ 9 superior SDDMM throughput over prior state-of-the-art, with $S$ 0 measured end-to-end application speedups (Shi et al., 2024).
ESIMD/SYCL on Intel GPUs: By explicitly programming vector operations and prefetching for gather-scatter patterns, SDDMM kernels reach $S$ 1 of theoretical peak for the device, up to $S$ 2 TFLOP/s, and $S$ 3 faster than CUDA V100 implementations on select benchmarks. Critical optimizations include register-blocking, chunked simd, occupancy balancing, and indirect load prefetching (Zubair et al., 2023).
Cerebras CS-3 Accelerator: Uses a 2D PE grid with dataflow routes; SDDMM is mapped as a streaming operation where each worker PE computes only needed nonzero entries, with host-to-device and device-to-host bandwidth directly proportional to dense and sparse blocks, respectively. For densities $S$ 4, this approach yields $S$ 5 CPU speedup; at extreme sparsity ( $S$ 6), bandwidth overhead from padding becomes dominant (Shah et al., 30 Apr 2026).

4. Algorithmic Fusion: SDDMM+SpMM (FusedMM)

SDDMM often directly feeds into a SpMM, e.g., in message passing for GNNs or factor updates in collaborative filtering. Materializing the SDDMM output before SpMM incurs avoidable DRAM pressure and communication. The “FusedMM” kernel performs both without intermediate writes:

Design: FusedMM unifies per-edge SDDMM computation and per-vertex SpMM reduction inside a single kernel, using user-supplied vector/scalar/message operations matching the embedding or neural model semantics (Rahman et al., 2020).
Benefits: Eliminates large temporary buffers ( $S$ 7), dramatically reduces memory traffic, and enables larger mini-batches and embedding sizes.
Distributed Fusion: In large-scale distributed-memory settings, communication-eliding strategies—either by reusing block replication across both stages or fusing the local cyclic-shift loops into one—achieve $S$ 8 reduction in communication time over naive kernel sequencing, with total runtime $S$ 9 faster and up to $R \in \mathbb{R}^{m \times n}$ 0 faster than PETSc SpMM in real-world scenarios (Bharadwaj et al., 2022).
Portability and Platform-Specifics: FusedMM approaches generalize efficiently across Intel, AMD, and ARM, and can be fused with sparse-communication strategies (SpComm3D) in distributed 3D layouts (Abubaker et al., 2024, Rahman et al., 2020).

5. Complexity, Communication, and Memory Footprint

Key performance metrics and formulas include:

Model / Method	Comm Volume per Rank	Memory per Rank	Peak Speedup
Dense3D	$R \in \mathbb{R}^{m \times n}$ 1	$R \in \mathbb{R}^{m \times n}$ 2	(Baseline)
SpComm3D (SA)	$R \in \mathbb{R}^{m \times n}$ 3	$R \in \mathbb{R}^{m \times n}$ 4	$R \in \mathbb{R}^{m \times n}$ 5 (runtime)
1.5D/2.5D RepReuse	$R \in \mathbb{R}^{m \times n}$ 6	block-cyclic by design	$R \in \mathbb{R}^{m \times n}$ 7 comm. reduction
FlashSparse (GPU)	N/A (device-level)	$R \in \mathbb{R}^{m \times n}$ 8 lower index/value storage	$R \in \mathbb{R}^{m \times n}$ 9

All numerical figures appear directly in the referenced papers (Abubaker et al., 2024, Bharadwaj et al., 2022, Shi et al., 2024, Zubair et al., 2023). Work per nonzero is typically $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ 0 flop (dot product: $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ 1 multiply–add pairs).

6. Workload Integration and Empirical Impact

SDDMM is a linchpin operation in GNN pipelines (attention/inference), graph embedding (ForceAtlas2, VERSE, Force2Vec), and collaborative filtering (alternating least squares).

Graph Processing: Enables edge-wise attention and message construction in GATs, AGNN, and classical layouts (Rahman et al., 2020, Bharadwaj et al., 2022).
Performance: On large matrices (hundreds of millions to billions of edges), state-of-the-art distributed SDDMM achieves $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ 2– $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ 3 speedup over established CPU and library baselines, both at the kernel and end-to-end application level.
Bandwidth and Memory Pressure: Sparsity-aware and fused approaches alleviate I/O and DRAM bottlenecks, enabling real-time, large-scale training that is otherwise infeasible with dense or naive sparse kernels (Abubaker et al., 2024, Bharadwaj et al., 2022, Rahman et al., 2020).

7. Implementation Guidelines and Future Considerations

Implementation and adoption of SDDMM must consider:

Data Decomposition: 3D sparsity-aware partitioning is advised for maximal scalability; choose $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ 4 and owner assignment to minimize overlapping communication.
Kernel Design: Explicit vectorization, prefetching for index indirection, and chunk-based register management are essential for leveraging full hardware throughput for both sparse and dense operands (Zubair et al., 2023, Shi et al., 2024).
Pattern Regularity: The benefit of sparsity-aware methods grows as sparsity and pattern irregularity increase; at extreme sparsity, padding and underutilization become limiting factors (Shah et al., 30 Apr 2026).
Fused Pipelines: Whenever SDDMM output directly feeds into SpMM or similar operations, fusion is optimal to reduce memory and orchestration overhead.
Trade-offs: Minimum block size (e.g., $R_{ij} = S_{ij} \cdot \langle A_{i,:}, B_{j,:} \rangle = S_{ij} \cdot \sum_{k=1}^r A_{ik} B_{jk}$ 5 tiles for FlashSparse), static pattern constraints, and preprocessing/storage format transformations balance maximum theoretical throughput with practical complexity and resource limits (Shi et al., 2024).

SDDMM remains an area of active development, with ongoing research pursuing finer-grained fusion, sparser-friendly dataflow orchestration on novel accelerators, and portable high-level APIs matched to modern hardware (Bharadwaj et al., 2022, Abubaker et al., 2024, Rahman et al., 2020, Shi et al., 2024, Zubair et al., 2023, Shah et al., 30 Apr 2026).