Papers
Topics
Authors
Recent
Search
2000 character limit reached

Parallel Small Polynomial Multiplication with TEE

Updated 13 April 2026
  • The paper introduces PSPM-TEE, a framework that reduces the cost of sparse polynomial multiplication by exploiting structural sparsity and parallelism.
  • It leverages packed accumulators and SIMD instructions (AVX2/AVX-512) to achieve 1.5×–2× speedups over classical NTT-based methods.
  • Tailored Early Evaluation interleaves coefficient computation with on-the-fly validity checks, aborting early to optimize the signing process.

Parallel Small Polynomial Multiplication with Tailored Early Evaluation (PSPM-TEE) is an algorithmic framework for accelerating the multiplication of sparse challenge polynomials with vectors of small secret or error polynomials, specifically within the signing procedure of the lattice-based post-quantum signature scheme CRYSTALS-Dilithium. PSPM-TEE leverages both the structural sparsity involved in the polynomial product and architectural features of modern SIMD vector extensions—most notably AVX2 and AVX-512—to achieve significant performance gains relative to number-theoretic transform (NTT)-based approaches (Zheng et al., 2023).

1. Formal Framework and Mathematical Description

Let Rq=Zq[x]/(xn+1)R_q = \mathbb{Z}_q[x]/(x^n+1), with typical parameters n=256n=256 and q=8380417q=8380417 as in Dilithium. The challenge polynomial c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q is characterized by exactly τ\tau nonzero coefficients ci∈{±1}c_i\in\{\pm1\}, while all other ci=0c_i=0. The secret or error polynomials are represented as a vector (a(0)(x),…,a(r−1)(x))(a^{(0)}(x), …, a^{(r-1)}(x)) of rr ring elements.

The classical ring convolution to compute each entry of the product c(x)â‹…a(j)(x)c(x)\cdot a^{(j)}(x) is: n=256n=2560 This is often accelerated by NTT-based multiplication, giving n=256n=2561 cost in terms of modular multiplies.

In contrast, PSPM exploits sparsity and parallelism. The key elements are:

  • Choose integer base n=256n=2562, with n=256n=2563.
  • For n=256n=2564, compute packed accumulators:

n=256n=2565

  • Precompute n=256n=2566.
  • For each coefficient slot n=256n=2567:

n=256n=2568

  • Decode the n=256n=2569 products in parallel by recursive division/modulo by q=8380417q=83804170 and subtraction of q=8380417q=83804171, correspondingly retrieving each q=8380417q=83804172. This scheme is q=8380417q=83804173 in 32-bit adds and shifts.

2. Tailored Early Evaluation in the Dilithium Signing Procedure

Tailored Early Evaluation (TEE) augments PSPM by interleaving the computation of output coefficients with validity checks needed in Dilithium signing. Specifically, after forming

q=8380417q=83804174

it must be checked that q=8380417q=83804175 and q=8380417q=83804176. As each check can fail with non-negligible probability, the TEE method computes and checks each coefficient on the fly: if any coefficient violates the bounds, the algorithm aborts early, saving further computation. The test order is optimized per Dilithium parameter set by ordering checks according to descending failure probability, further improving expected efficiency.

3. Algorithmic Structure and Practical Details

The overall PSPM-TEE process involves:

  • Precomputing q=8380417q=83804177 arrays for q=8380417q=83804178 and q=8380417q=83804179.
  • For each output slot, accumulating c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q0 as explained above.
  • Decoding the relevant coefficients by recursive division/modulo c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q1.
  • Applying c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q2 and the c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q3-norm bounds as dictated.
  • Upon first failure in any coefficient check, the algorithm aborts and restarts—reducing average computational cost proportionally with the success rate.

The method requires modest extra memory for M-encoded lookup tables (e.g., about 8 KiB for Dilithium III/V and 4 KiB for Dilithium II), and is particularly effective for parallel SIMD implementations due to its reliance on additions and shifts rather than modular multiplications.

4. Comparative Analysis: PSPM-TEE versus NTT

The NTT-based approach for challenge polynomial multiplication is characterized by a forward transform, pointwise multiplication, and inverse transform for each operand, totaling c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q4 butterfly operations (e.g., c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q5 for c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q6), each requiring modular multiplications and additions.

In contrast, PSPM-TEE for typical Dilithium parameter sets requires only c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q7 32-bit adds and shifts—operations that are more efficiently vectorized and less computationally intensive on modern hardware. For Dilithium III, the relevant parameters are c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q8 (so c(x)=∑i=0n−1cixi∈Bτ⊂Rqc(x) = \sum_{i=0}^{n-1}c_i x^i \in B_\tau \subset R_q9 scalar operations dominated by adds and shifts).

Empirical benchmarks at 2.5 GHz on Intel i7-11700F demonstrate the following (full details in Table VII of (Zheng et al., 2023)):

Operation AVX2 (cycles) AVX-512 (cycles) NTT AVX2 NTT AVX-512 PSPM speedup
Ï„\tau0 (Dil3) 6,748 2,556 14,636 6,740 54%/62%
Ï„\tau1 (Dil3) 8,358 2,560 16,010 7,480 48%/66%

This illustrates a τ\tau2–τ\tau3 real-world speedup for PSPM-TEE over NTT for challenge-multiplications.

5. AVX2 / AVX-512 Vectorization Strategies

SIMD vectorization is central to the performance of PSPM-TEE. The structure allows packing one M-encoded accumulator per 32-bit lane, giving:

  • 8-way parallelism on AVX2,
  • 16-way parallelism on AVX-512.

Key instructions and considerations:

  • vpsrlq for logical shifts to implement division by Ï„\tau4.
  • vpmadd52luq (AVX-512IFMA) for efficient fused modular multiply-add (52-bit Ï„\tau5 64-bit).
  • Masking and shuffle operations (e.g., vpermq, vpblendmd, vshufi32x4, vpunpcklqdq, vpunpckhqdq) are used to implement butterfly stages.
  • Tailored reduction modulo Ï„\tau6 is realized in two SIMD instructions: logical shift and multiply-add, minimizing memory traffic and instruction count.

With careful data layout and instruction-level scheduling, AVX-512 implementations show substantial measured improvements—up to τ\tau7 in key generation, τ\tau8 in signing, and τ\tau9 in verification for Dilithium 2/3/5, respectively, over highly optimized AVX2 baselines.

6. Applicability, Limitations, and Observed Performance

PSPM-TEE is applicable to sparse challenge polynomial multiplications as found in CRYSTALS-Dilithium’s signing but does not supersede NTT for generic dense×dense polynomial products. Its efficiency gains are tightly coupled to the challenge polynomial's sparsity and the typical size of ci∈{±1}c_i\in\{\pm1\}0, which is typically ci∈{±1}c_i\in\{\pm1\}1–ci∈{±1}c_i\in\{\pm1\}2 in deployed parameters.

Limitations include:

  • Gains are proportional to early exit probability, which depends on the probability of violating norm bounds on ci∈{±1}c_i\in\{\pm1\}3 or ci∈{±1}c_i\in\{\pm1\}4; absolute savings may be lower when these probabilities are small.
  • Slightly increased memory usage for tables.
  • Maximum benefit is realized on platforms with wide SIMD (AVX-512); on lesser hardware improvements are still present but somewhat reduced.

Across full CRYSTALS-Dilithium signing, end-to-end cycle counts for Dilithium III are as follows: ci∈{±1}c_i\in\{\pm1\}5 cycles for a C baseline; ci∈{±1}c_i\in\{\pm1\}6 cycles for AVX2 baseline; ci∈{±1}c_i\in\{\pm1\}7 cycles for AVX-512. The corresponding AVX-512 implementation is ci∈{±1}c_i\in\{\pm1\}8 faster than AVX2 and ci∈{±1}c_i\in\{\pm1\}9 faster than the C baseline.

7. Significance in Post-Quantum Cryptography

By providing a highly optimized, vectorized approach for a critical operation in one of the NIST-standardized post-quantum signature schemes, PSPM-TEE has significant implications for implementation efficiency, particularly on x86-64 platforms supporting AVX-512. Its methodology—combining algebraic structure, sparsity, parallelization, and early abort—offers a template for further improvements in lattice-based cryptographic protocols, though its direct scope is constrained to operations involving sparse challenge polynomials (Zheng et al., 2023).

Key empirical, algorithmic, and architectural details, along with complete performance tables, are provided in the foundational reference (Zheng et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Parallel Small Polynomial Multiplication with Tailored Early Evaluation (PSPM-TEE).