Parallel Small Polynomial Multiplication with TEE

Updated 13 April 2026

The paper introduces PSPM-TEE, a framework that reduces the cost of sparse polynomial multiplication by exploiting structural sparsity and parallelism.
It leverages packed accumulators and SIMD instructions (AVX2/AVX-512) to achieve 1.5×–2× speedups over classical NTT-based methods.
Tailored Early Evaluation interleaves coefficient computation with on-the-fly validity checks, aborting early to optimize the signing process.

Parallel Small Polynomial Multiplication with Tailored Early Evaluation (PSPM-TEE) is an algorithmic framework for accelerating the multiplication of sparse challenge polynomials with vectors of small secret or error polynomials, specifically within the signing procedure of the lattice-based post-quantum signature scheme CRYSTALS-Dilithium. PSPM-TEE leverages both the structural sparsity involved in the polynomial product and architectural features of modern SIMD vector extensions—most notably AVX2 and AVX-512—to achieve significant performance gains relative to number-theoretic transform (NTT)-based approaches (Zheng et al., 2023).

1. Formal Framework and Mathematical Description

Let $R_q = \mathbb{Z}_q[x]/(x^n+1)$ , with typical parameters $n=256$ and $q=8380417$ as in Dilithium. The challenge polynomial $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ is characterized by exactly $\tau$ nonzero coefficients $c_i\in\{\pm1\}$ , while all other $c_i=0$ . The secret or error polynomials are represented as a vector $(a^{(0)}(x), …, a^{(r-1)}(x))$ of $r$ ring elements.

The classical ring convolution to compute each entry of the product $c(x)\cdot a^{(j)}(x)$ is: $n=256$ 0 This is often accelerated by NTT-based multiplication, giving $n=256$ 1 cost in terms of modular multiplies.

In contrast, PSPM exploits sparsity and parallelism. The key elements are:

Choose integer base $n=256$ 2, with $n=256$ 3.
For $n=256$ 4, compute packed accumulators:

$n=256$ 5

Precompute $n=256$ 6.
For each coefficient slot $n=256$ 7:

$n=256$ 8

Decode the $n=256$ 9 products in parallel by recursive division/modulo by $q=8380417$ 0 and subtraction of $q=8380417$ 1, correspondingly retrieving each $q=8380417$ 2. This scheme is $q=8380417$ 3 in 32-bit adds and shifts.

2. Tailored Early Evaluation in the Dilithium Signing Procedure

Tailored Early Evaluation (TEE) augments PSPM by interleaving the computation of output coefficients with validity checks needed in Dilithium signing. Specifically, after forming

$q=8380417$ 4

it must be checked that $q=8380417$ 5 and $q=8380417$ 6. As each check can fail with non-negligible probability, the TEE method computes and checks each coefficient on the fly: if any coefficient violates the bounds, the algorithm aborts early, saving further computation. The test order is optimized per Dilithium parameter set by ordering checks according to descending failure probability, further improving expected efficiency.

3. Algorithmic Structure and Practical Details

The overall PSPM-TEE process involves:

Precomputing $q=8380417$ 7 arrays for $q=8380417$ 8 and $q=8380417$ 9.
For each output slot, accumulating $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 0 as explained above.
Decoding the relevant coefficients by recursive division/modulo $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 1.
Applying $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 2 and the $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 3-norm bounds as dictated.
Upon first failure in any coefficient check, the algorithm aborts and restarts—reducing average computational cost proportionally with the success rate.

The method requires modest extra memory for M-encoded lookup tables (e.g., about 8 KiB for Dilithium III/V and 4 KiB for Dilithium II), and is particularly effective for parallel SIMD implementations due to its reliance on additions and shifts rather than modular multiplications.

4. Comparative Analysis: PSPM-TEE versus NTT

The NTT-based approach for challenge polynomial multiplication is characterized by a forward transform, pointwise multiplication, and inverse transform for each operand, totaling $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 4 butterfly operations (e.g., $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 5 for $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 6), each requiring modular multiplications and additions.

In contrast, PSPM-TEE for typical Dilithium parameter sets requires only $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 7 32-bit adds and shifts—operations that are more efficiently vectorized and less computationally intensive on modern hardware. For Dilithium III, the relevant parameters are $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 8 (so $c(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q$ 9 scalar operations dominated by adds and shifts).

Empirical benchmarks at 2.5 GHz on Intel i7-11700F demonstrate the following (full details in Table VII of (Zheng et al., 2023)):

Operation	AVX2 (cycles)	AVX-512 (cycles)	NTT AVX2	NTT AVX-512	PSPM speedup
$\tau$ 0 (Dil3)	6,748	2,556	14,636	6,740	54%/62%
$\tau$ 1 (Dil3)	8,358	2,560	16,010	7,480	48%/66%

This illustrates a $\tau$ 2– $\tau$ 3 real-world speedup for PSPM-TEE over NTT for challenge-multiplications.

5. AVX2 / AVX-512 Vectorization Strategies

SIMD vectorization is central to the performance of PSPM-TEE. The structure allows packing one M-encoded accumulator per 32-bit lane, giving:

8-way parallelism on AVX2,
16-way parallelism on AVX-512.

Key instructions and considerations:

vpsrlq for logical shifts to implement division by $\tau$ 4.
vpmadd52luq (AVX-512IFMA) for efficient fused modular multiply-add (52-bit $\tau$ 5 64-bit).
Masking and shuffle operations (e.g., vpermq, vpblendmd, vshufi32x4, vpunpcklqdq, vpunpckhqdq) are used to implement butterfly stages.
Tailored reduction modulo $\tau$ 6 is realized in two SIMD instructions: logical shift and multiply-add, minimizing memory traffic and instruction count.

With careful data layout and instruction-level scheduling, AVX-512 implementations show substantial measured improvements—up to $\tau$ 7 in key generation, $\tau$ 8 in signing, and $\tau$ 9 in verification for Dilithium 2/3/5, respectively, over highly optimized AVX2 baselines.

6. Applicability, Limitations, and Observed Performance

PSPM-TEE is applicable to sparse challenge polynomial multiplications as found in CRYSTALS-Dilithium’s signing but does not supersede NTT for generic dense×dense polynomial products. Its efficiency gains are tightly coupled to the challenge polynomial's sparsity and the typical size of $c_i\in\{\pm1\}$ 0, which is typically $c_i\in\{\pm1\}$ 1– $c_i\in\{\pm1\}$ 2 in deployed parameters.

Limitations include:

Gains are proportional to early exit probability, which depends on the probability of violating norm bounds on $c_i\in\{\pm1\}$ 3 or $c_i\in\{\pm1\}$ 4; absolute savings may be lower when these probabilities are small.
Slightly increased memory usage for tables.
Maximum benefit is realized on platforms with wide SIMD (AVX-512); on lesser hardware improvements are still present but somewhat reduced.

Across full CRYSTALS-Dilithium signing, end-to-end cycle counts for Dilithium III are as follows: $c_i\in\{\pm1\}$ 5 cycles for a C baseline; $c_i\in\{\pm1\}$ 6 cycles for AVX2 baseline; $c_i\in\{\pm1\}$ 7 cycles for AVX-512. The corresponding AVX-512 implementation is $c_i\in\{\pm1\}$ 8 faster than AVX2 and $c_i\in\{\pm1\}$ 9 faster than the C baseline.

7. Significance in Post-Quantum Cryptography

By providing a highly optimized, vectorized approach for a critical operation in one of the NIST-standardized post-quantum signature schemes, PSPM-TEE has significant implications for implementation efficiency, particularly on x86-64 platforms supporting AVX-512. Its methodology—combining algebraic structure, sparsity, parallelization, and early abort—offers a template for further improvements in lattice-based cryptographic protocols, though its direct scope is constrained to operations involving sparse challenge polynomials (Zheng et al., 2023).

Key empirical, algorithmic, and architectural details, along with complete performance tables, are provided in the foundational reference (Zheng et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Optimized Vectorization Implementation of CRYSTALS-Dilithium (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Small Polynomial Multiplication with Tailored Early Evaluation (PSPM-TEE).