Papers
Topics
Authors
Recent
Search
2000 character limit reached

Number-Theoretic Transform (NTT)

Updated 23 February 2026
  • NTT is a discrete finite-field transform that efficiently computes polynomial multiplications and convolutions central to post-quantum cryptography and fully homomorphic encryption.
  • Hardware implementations use radix-2 Cooley–Tukey butterfly iterations combined with Montgomery reduction to perform modular arithmetic with minimal delay and area overhead.
  • Integrated fault-detection methods, such as REMO and Memory Rule Checkers, enable robust error identification in FPGA architectures, achieving coverage from 87% to 100%.

The Number-Theoretic Transform (NTT) is a discrete, finite-field analogue of the classical Discrete Fourier Transform (DFT), enabling efficient computation of polynomial multiplications and convolutions central to modern post-quantum cryptography (PQC) and fully homomorphic encryption (FHE). In cryptographic hardware and embedded systems, robust NTT architectures necessitate both high computational throughput and resilience against hardware faults, natural or adversarial. Recent advances center on lightweight, logic-embedded fault-detection strategies suitable for Field Programmable Gate Array (FPGA) realization without incurring prohibitive area, delay, or energy overheads (Paul et al., 5 Aug 2025).

1. Mathematical Definition and Transform Structure

Let qq be a prime such that q≡1(modn)q \equiv 1 \pmod n and ω\omega a primitive nn-th root of unity in Fq\mathbb{F}_q. For a vector a=(a0,…,an−1)∈Fqna = (a_0, \ldots, a_{n-1}) \in \mathbb{F}_q^n, the NTT and its inverse are defined as: NTT(a)k=∑j=0n−1aj ωj k mod q,k=0,…,n−1\mathrm{NTT}(a)_k = \sum_{j=0}^{n-1} a_j \, \omega^{j\,k} \bmod q, \quad k=0,\ldots,n-1

NTT−1(A)j=n−1∑k=0n−1Ak ω−j k mod q,j=0,…,n−1\mathrm{NTT}^{-1}(A)_j = n^{-1} \sum_{k=0}^{n-1} A_k\,\omega^{-j\,k} \bmod q, \quad j=0,\ldots,n-1

where n−1n^{-1} is the modular inverse of nn modulo qq.

The standard hardware implementation follows a radix-2 Cooley–Tukey butterfly iteration: at each stage, pairs of elements (U,V)(U, V) are combined using a twiddle factor ω\omega as

T=(V⋅ω) mod q;X′=(U+T) mod q;Y′=(U−T) mod qT = (V \cdot \omega) \bmod q; \quad X' = (U + T) \bmod q; \quad Y' = (U - T) \bmod q

A fully pipelined architecture divides this computation into stages: buffering, modular multiplication, and modular addition/subtraction.

2. Modular Reduction and Butterfly Realization

Hardware-efficient NTTs replace direct modular reductions with Montgomery reduction. Given R=2l>qR = 2^l > q, the reduction

MontRed(t)=t R−1 mod q\mathrm{MontRed}(t) = t\,R^{-1} \bmod q

uses precomputed q′≡−q−1(modR)q' \equiv -q^{-1} \pmod R and chunk-wise, word-oriented operations, facilitating deployment onto FPGA DSP and logic slices. The Cooley–Tukey Butterfly Unit (CT-BU) thus incorporates modular multiplication and addition/subtraction entirely via digital logic, optimizing resource utilization.

3. Fault-Detection via REMO (Recomputation With Modular Offset)

REMO introduces a structural, ultra-light fault-detection primitive directly embedded within the word-wise Montgomery reduction block. It operates as follows:

  • For each ww-bit window awiaw_i of an ll-bit operand α\alpha, define a "fault-encoded" version awif=awi+Kqaw_i^f = aw_i + Kq
  • Compute both normal reduction and offset reduction in parallel at each stage ii:

μi=((γ0..w−1+awi⋅β0) q′) mod 2w\mu_i = ((\gamma_{0..w-1} + aw_i \cdot \beta_0) \, q') \bmod 2^w

γi=(γi+awiβ+μiq)/2w\gamma_i = (\gamma_i + aw_i \beta + \mu_i q) / 2^w

μif=((γ0..w−1f+awif⋅β0) q′) mod 2w\mu_i^f = ((\gamma_{0..w-1}^f + aw_i^f \cdot \beta_0) \, q') \bmod 2^w

γif=(γif+awifβ+μifq)/2w\gamma_i^f = (\gamma_i^f + aw_i^f \beta + \mu_i^f q) / 2^w

  • Fault is flagged if γi≠γif\gamma_i \ne \gamma_i^f for any ii.

This method guarantees that in the absence of faults, the modular offset KqKq cancels out, ensuring correct operation and negligible delay and area overhead compared to the baseline logic. Fault coverage achieved ranges from 87.2% to 100% across random and burst fault modes, and generalizes robustly to different word sizes and operating configurations (Paul et al., 5 Aug 2025).

4. Memory Fault Detection: Memory Rule Checkers

NTT datapaths involve multiple memory units: RAMs for polynomial data and ROMs for twiddle factors. Two independent rule checkers, MemoryRC, monitor for address-generation faults:

  • RAM Checker (i–k rule): For each butterfly in stage ii, ensures index kk satisfies 0≤k<si0 \leq k < s_i (si=n/2ts_i = n/2^t)
  • ROM Checker (i–j rule): Within stage ii, verifies twiddle index 0≤j<2i0 \leq j < 2^i

Out-of-bounds or repeated access is immediately flagged as a soft fault. Empirical results demonstrate detection rates from 50.7% to 100%, with higher detection for burst errors and simultaneous RAM + ROM faults.

5. Empirical FPGA Evaluation and Resource-Performance Trade-offs

All methods were validated on Xilinx Artix-7 targets using Kyber-768 parameters (n=256,q=3329n=256, q=3329). Results:

Variant Slices DSPs Power (mW) SEC Overhead Coverage
Baseline CT-BU 73 1 104 — —
+ REMO 81 2 106 +17% area 87–100%
+ Memory RC 89 2 107 +8.5% area 51–100%

The total throughput is maintained at the baseline; area and power overheads remain under 8.5% and 2% respectively. Compared to prior approaches (e.g. RENO recomputation [Sarker et al.]: 15–24% area, 8–22% delay for ~99.5% logic coverage), this integrated defense achieves comparable or higher coverage at a fraction of the resource cost.

6. Context in PQC, Comparative Approaches, and Broader Impact

NTT-based polynomial multiplication is the computational linchpin of lattice-based PQC schemes (Kyber, NTRU, Ring-LWE, etc.) and so its reliability directly affects the security and throughput of these protocols. The presented REMO + Memory RC architecture sets a new benchmark in lightweight, application-integrated hardware fault tolerance, combining in-butterfly recomputation with modular offset and address-space rule-aware checking for full datapath resilience (Paul et al., 5 Aug 2025).

Relative to Hamming-code memories Khan et al., and to more general algorithm-level error detection (Ahmadi et al., 2024), this work is distinguished by sub-10% area and zero-latency cost at 87–100% logic and 51–100% memory coverage. These methods generalize natively across word sizes, bit-widths, and NTT parameter regimes, making them readily suitable for deployment in high-speed PQC network processors and side-channel-constrained cryptographic FPGAs.

7. Conclusions and Architectural Guidelines

Integrating REMO with modular offset and rule-checking logic in the core of the NTT pipeline enables robust, low-overhead hardware fault detection. This preserves critical performance metrics (area, energy, speed) while delivering near-complete detection of both transient and injected hardware faults. Such approaches are necessary for future network security processors and cryptographic accelerators that must operate reliably under both environmental noise and targeted adversarial conditions in post-quantum settings (Paul et al., 5 Aug 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Number-Theoretic Transform (NTT).