Single-DSP Multiple Multiplication (SDMM)

Updated 27 November 2025

Single-DSP–Multiple-Multiplication is a technique that enables multiple low-precision multiplications on a single DSP through efficient hardware packing and coding methods.
It employs strategies like INT4 packing and overpacking to boost DSP utilization while controlling error metrics such as mean absolute error.
SDMM protocols secure distributed matrix multiplication by using polynomial and algebraic geometry codes, reducing communication overhead with cooperative decoding.

Single-DSP–Multiple-Multiplication (SDMM) refers to the class of methodologies and codes enabling multiple low-precision multiplication operations, or secure distributed matrix multiplication tasks, to be performed efficiently by leveraging the available architectural resources in a Digital Signal Processor (DSP) block or by partitioning computations among servers in distributed protocols. The concept encapsulates both hardware-level packing techniques for maximizing arithmetic throughput on FPGAs, as well as algorithmic and coding strategies for secure outsourcing of matrix multiplications among a set of semi-trusted servers, under constraints of privacy, security, and communication efficiency.

1. Hardware-Efficient DSP Packing for Multiple Multiplications

The motivation for SDMM in hardware arises from the underutilization of bit width in modern FPGA DSP blocks, such as the Xilinx DSP48E1 or DSP48E2, when handling quantized/low-precision workloads typical in machine learning. Traditional multiply-accumulate (MAC) mapping assigns one MAC per DSP, leading to resource inefficiency for 4–8 bit data.

Packing strategies re-encode multiple low-bit operands into larger bit-width vectors, leveraging positional shifts and guard bits to compute several independent multiplications in one DSP block per clock cycle. For instance, four 4×4-bit multiplies can be performed in one DSP (INT4 packing) by selecting offsets $\gamma_{i,j} = \alpha_i + \beta_j$ in packed operands $X = \sum x_i 2^{\alpha_i}$ , $Y = \sum y_j 2^{\beta_j}$ , extracting each product by slicing the output $P = X \cdot Y$ after appropriate shifts and masking (Sommer et al., 2022).

Packing density $\rho$ (sum of output bits versus DSP width) and mean absolute error (MAE) are key metrics. With correction logic (LUTs and flip-flops), bias from bit overlap or floor division can be eliminated (MAE = 0), while an "overpacking" strategy squeezes up to six 4-bit multiplies with slight accuracy loss (MAE ≈ 0.47). These techniques generalize to arbitrary bit widths and also allow for multi-addition packing in the DSP accumulator for use in applications such as spiking neural networks.

Packing Method	Multiplies/DSP	MAE	Packing Density ( $\rho$ )
INT4, no correction	4	0.37	0.67
Overpacking	6	0.47	1.13
Addition packing	5 (9-bit adds)	0.51	0.94

2. Near-Precise Fixed-Point Parameter Approximation

Single-DSP–Multiple-Multiplication has also been developed as a near-precise algorithmic technique for separating multiplication and accumulation within the DSP block, thus enabling simultaneous computation of multiple fixed-point products and reducing the need for DSP resources (Kalali et al., 2021).

This method employs algebraic manipulation of weights $W$ to the form $W = 2^s(1 + 2^n \text{MW})$ , with MW restricted after approximation to a small set (e.g., five 3-bit values). The DSP pre-adder encodes multiple MW values, combined with activation in the multiplier input. Parallel look-up tables perform the post-processing, and a small ROM stores parameter tuples for off-chip compression. Compression rates of up to 33% are achieved without hardware overhead, or up to 97% with additional pruning and Huffman encoding. Empirical evaluation on CNNs (AlexNet, VGG-16, Tiny ImageNet) demonstrates negligible accuracy loss ( $\leq 0.4\%$ ) and substantial DSP savings (66–83%).

3. Secure Distributed Matrix Multiplication (SDMM) Protocols

From a systems perspective, SDMM protocols enable a user to outsource matrix products $AB$ to multiple servers while keeping the contents of $A$ and $B$ private against any coalition of up to $T$ colluding servers. A central performance metric is the download rate $R$ —number of desired symbols divided by the number of downloaded symbols.

Core SDMM protocols introduce polynomial codes, such as the GASP (Gap Additive Secure Polynomial) family, which implement matrix block partitioning and random masking to maximize rate and guarantee information-theoretic $T$ -security (D'Oliveira et al., 2018). GASP arranges masking blocks to induce collisions in the degree table, minimizing the number of distinct evaluation points required for decoding and improving rate over classical (Yu–Maddah–Ali–Avestimehr) codes.

Recent advances extend single-variable codes to the outer product partitioning (OPP) regime, formalizing degree tables and introducing new families (CAT, DOG, GASPrs) that optimize the number of required workers through cyclic addition and roots of unity evaluation. CAT codes exploit evaluation over roots of unity, reducing the number of workers $N$ in low-privacy regimes and enabling modular arithmetic in the degree table for further rate improvements (Hofmeister et al., 21 Jan 2025).

SDMM Code Family	Partitioning	Evaluation Points	Optimal Regime
GASP	General (poly)	integer progression	All $K,L,T$ , large $K$
CAT	Outer-product	cyclic (roots unity)	Small $T$ , large $K,L$
DOG, GASPrs	General	integer progression	Moderate/high privacy

4. Algebraic-Geometric and Root-of-Unity Code Constructions

The HerA (Hermitian Algebraic) scheme further generalizes SDMM by encoding input matrices in Hermitian AG codes (and their duals), facilitating inner-product partitioning while operating over significantly smaller fields than univariate polynomial codes (Machado et al., 2023). HerA leverages the high number of evaluation points on the Hermitian curve ( $q^3$ rational points over $\mathbb{F}_{q^2}$ ) to support $N \gg q^2$ servers and dramatically expands security capacity ( $T \approx q^3/2 - L$ ).

Root-of-unity-based constructions for grid-partition cases interpolate between inner- and outer-product regimes, matching best-known communication loads in both settings and offering a continuous bandwidth/latency trade-off by selecting grid parameters $(t,s,d)$ for block partitioning (Machado et al., 2022). Decoding exploits the discrete-Fourier property for recovery, and security derives from Vandermonde-submatrix invertibility.

5. Cooperative and Communication-Efficient SDMM Protocols

Classical SDMM protocols treat server collusion solely as a threat, leading to increased download thresholds. Cooperative SDMM (C-SDMM) schemes exploit inter-server links for collaborative decoding, enabling clusters of up to $X$ colluding servers to aggregate and deliver recovery results, reducing user-side download costs by a factor $\sim X$ at the expense of increased server-to-server cooperation (Li et al., 2021).

The protocol partitions upload, compute/cooperate, and download/decoding phases, using Reed–Solomon codes for masking and Lagrange interpolation for result aggregation. Information-theoretic security remains intact provided random mask terms are present. Parameter selection allows trading off minimal download ( $D \rightarrow 2tr$ ) against cooperation costs for practical deployment in proximity-enabled architectures.

6. Communication, Security, and Decoding Analysis

Across all SDMM approaches, encoding polynomials carry both data and random masking blocks. Security is formalized as $I(A,B; \text{all data seen by any }T \text{ servers}) = 0$ , guaranteed by MDS or Vandermonde submatrix properties. Decoding typically requires inverting Vandermonde systems, or FFT-based interpolation in root-of-unity codes.

Rate formulas, communication complexity, and server-side compute requirements are tightly coupled to code choice, partitioning regime, and field-size constraints. Algebraic geometry codes (HerA) and root-of-unity schemes provide improved scalability in both field size and security allowance.

7. Practical Implications, Extensions, and Comparative Summary

Single-DSP–Multiple-Multiplication encompasses a broad band of methodologies from hardware-centric packing and compression for quantized ML inference, to sophisticated polynomial and AG code-based distributed protocols for secure outsourced matrix multiplication in private learning and privacy-preserving data analysis.

Key performance factors include packing density, DSP savings, compression rates, rate improvement per worker, and decodability thresholds. Algorithmic choices affect hardware requirements (LUTs/FFs), error tolerance, and precision. Polynomial-code protocols and their extensions (GASP, CAT, DOG) deliver information-theoretic privacy and flexible communication cost, with numerical results indicating strict improvements over prior art in selected parameter regimes (Hofmeister et al., 21 Jan 2025).

Research directions are oriented toward dynamic packing/adaptive partitioning, further bandwidth minimization, mixed-precision arithmetic, and integration with large-scale secure ML or private outsourcing frameworks.