NTT: Number Theoretic Transform

Updated 14 September 2025

NTT is a discrete transform defined over finite fields or rings, enabling efficient, rounding-free arithmetic for polynomial multiplication.
It uses fast divide-and-conquer algorithms, such as Cooley–Tukey and Gentleman–Sande, to reduce computational complexity from quadratic to quasilinear time.
NTT optimizations are critical for lattice-based cryptography, homomorphic encryption, and secure digital signal processing, powering high-throughput applications.

The Number Theoretic Transform (NTT) is a discrete transform defined over finite fields or rings, serving as a finite-field analogue of the Discrete Fourier Transform (DFT). In contrast to the DFT, which employs complex exponentials over real or complex numbers, the NTT operates strictly within the algebraic constraints of modular arithmetic and is a cornerstone of efficient polynomial multiplication in lattice-based cryptography, homomorphic encryption, and secure digital signal processing applications. By using roots of unity in carefully chosen rings, the NTT enables quasilinear time algorithms for operations that, in their naive form, require quadratic time, while guaranteeing exact, rounding-free arithmetic.

1. Mathematical Definition and Transform Variants

Let $n$ be a positive integer, $q$ an integer modulus (typically prime), and $\omega$ a primitive $n$ th root of unity in $\mathbb{Z}_q$ such that $\omega^n \equiv 1 \pmod{q}$ but $\omega^k \not\equiv 1 \pmod{q}$ for $0 < k < n$. For $\mathbf{v} = (v_0,\ldots,v_{n-1}) \in \mathbb{Z}_q^n$ , the Forward NTT is:

$\hat{v}_j = \sum_{i=0}^{n-1} v_i \cdot \omega^{ij} \pmod{q},\quad j = 0,\ldots,n-1$

The Inverse NTT is:

$v_i = n^{-1} \sum_{j=0}^{n-1} \hat{v}_j \cdot \omega^{-ij} \pmod{q}$

When polynomial multiplication is performed in the ring $\mathbb{Z}_q[x]/(x^n - 1)$ (“cyclic” or positive-wrapped convolution) or $\mathbb{Z}_q[x]/(x^n + 1)$ (“negacyclic” or negative-wrapped convolution), the NTT can be adapted accordingly. Negative-wrapped NTTs often utilize a $2n$-th root $\psi$ so that $\psi^{2} = \omega$ and $\psi^n \equiv -1 \pmod{q}$ (Sengupta et al., 7 Sep 2025, Liang et al., 2022).

2. Computational Techniques and Fast Algorithms

The NTT supports fast divide-and-conquer algorithms closely related to the Fast Fourier Transform (FFT). Two principal radix-2 algorithms are widely employed (Liang et al., 2022):

Cooley–Tukey (CT) Decomposition: Uses recursive "butterfly" operations on even and odd-indexed coefficients:

$\begin{aligned} \hat{a}_j &= \hat{a}'_j + \omega^j \hat{a}''_j \ \hat{a}_{j + n/2} &= \hat{a}'_j - \omega^j \hat{a}''_j \end{aligned}$

with $\hat{a}'_j$ , $\hat{a}''_j$ being NTTs of the even/odd subsequences, and input or output optionally in bit-reversed order.
Gentleman–Sande (GS) Algorithm: A "decimation-in-frequency" approach suitable for the inverse NTT, employing analogous butterfly operations but differentiated in summation order and multiplicative structure.
Negative Wrapped Convolution (NWC): For negacyclic convolution, coefficients are pre-multiplied and post-multiplied by suitable roots of unity (often $\psi^{i}$ ) to cost-effectively compute $\mathbb{Z}_q[x]/(x^n + 1)$ products without zero-padding (Sengupta et al., 7 Sep 2025, Chiu et al., 2023).

State-of-the-art hardware and software implementations exploit in-place computation, tailored memory layouts (bit-reversed addressing), and minimal data movement to maximize throughput (Zhang et al., 2023, Li et al., 2022, Park et al., 2023).

3. Practical Applications in Cryptography and Signal Processing

NTT-based algorithms are a central component of lattice-based cryptosystems, including key-encapsulation mechanisms (e.g., Kyber), signature schemes (e.g., Dilithium, Falcon), homomorphic encryption, and RLWE/NTRU variants (Liang et al., 2022, Pedrouzo-Ulloa et al., 2016, Sengupta et al., 7 Sep 2025). Major applications include:

Polynomial Multiplication: Efficiently implemented in $\mathbb{Z}_q[x]/(f(x))$ for large $n$ using the NTT, reducing complexity from $O(n^2)$ to $O(n \log n)$ for convolutions and associated multiplications (Liang et al., 2022, Chiu et al., 2023).
Encrypted Signal Processing: NTTs allow secure polynomial convolutions, filtering, and matrix operations directly on encrypted data using RLWE-based homomorphic encryption (Pedrouzo-Ulloa et al., 2016). Through pre/post-coding (element-wise scaling) and relinearization, various signal processing primitives are realized without exposing plaintext.
Homomorphic Encryption Bootstrapping: In bootstrappable HE, where large-degree polynomial products over CRT-decomposed moduli are required, NTT acceleration is the performance bottleneck (Kim et al., 2020, Cui et al., 16 Feb 2025, Ding et al., 18 May 2024).
Batching and SIMD: Packing and manipulating multiple logical messages into the coefficients of an encrypted polynomial is facilitated by Chinese Remainder Theorem-based batching and NTT-based transforms (Pedrouzo-Ulloa et al., 2016).

4. Advanced Implementation and Hardware Acceleration

A substantial body of recent work addresses the efficient realization of NTTs on modern hardware, focusing on both general-purpose and cryptographic-specific platforms:

SRAM and In-Memory Acceleration: MeNTT (Li et al., 2022) and BP-NTT (Zhang et al., 2023) place computation directly in SRAM or cache arrays, minimizing data movement and power; bit-serial and bit-parallel modular multipliers are tightly integrated to match butterfly throughput with in-memory storage bandwidth.
Digital and FPGA Pipelines: Designs including pipelined FIFO architectures (Heidarpur et al., 21 Jan 2025), unified FFT/NTT engines (Shrivastava et al., 15 Apr 2025), and digit-serial modular pipelines (Alexakis et al., 16 Jul 2025) demonstrate ways to harness parallelism and pipelining for low-latency high-throughput polynomial arithmetic, often using carefully constructed redundant representations to suppress intermediate reductions.
GPU Acceleration: Techniques include batching independent NTTs, employing high-radix butterfly kernels, register/block-sharing, and memory access optimization to overcome bandwidth bottlenecks. The introduction of on-the-fly twiddling schemes (computing roots of unity values within the butterfly) reduces memory demand and improves scalability (Kim et al., 2020, Cui et al., 16 Feb 2025).
Superconductor Electronics (SCE): SCE-NTT (Razmkhah et al., 28 Aug 2025) utilizes single flux quantum logic and shift-register-based memory in a deeply pipelined topology, achieving extremely high frequency and throughput, with the Shoup modular multiplier as the preferred primitive for low-path-balancing and speed.
Error Detection and Fault Tolerance: NTT hardware is subject to fault attacks. Proposed schemes such as REMO (recomputation with a modular offset) and memory rule checkers (Paul et al., 5 Aug 2025), as well as algorithm-level detection leveraging invariants in the NTT or the structure of negative wrapped convolution (Ahmadi et al., 2 Mar 2024), achieve coverage near 100% without significant resource or latency penalties.

5. Flexibility, Parameterization, and Algorithmic Extensions

Recent research advances the use of the NTT beyond traditional constraints:

Relaxed Modulus Constraints: Techniques including truncated/incomplete NTT (cropping recursion levels), splitting the polynomial ring, or using larger "NTT-friendly" moduli extend NTT benefits to settings where the modulus $q$ does not satisfy $q \equiv 1 \pmod{2n}$ (Liang et al., 2022, Chiu et al., 2023).
Algorithmic Variants and Hybrid Architectures:
- NTTSuite (Ding et al., 18 May 2024) classifies and implements seven NTT algorithms (DIT, DIF, Flat-NTT, Pease, Pease_nc, Six-step, Stockham) for CPU, GPU, and FPGA, benchmarked for trade-offs in data movement and parallelism.
- Unified FFT/NTT accelerators (Shrivastava et al., 15 Apr 2025) enable the hardware reuse of deep digital signal processing pipelines for cryptography, requiring only minor adjustments (modular reduction, twiddle ROM).
Fault-Tolerance and Security Enhancements: Provisions for side-channel resistance, error checking through recomputation (using modular offsets or result invariants), and structured memory access rule checking are proposed for robust implementations in adversarial environments (Ahmadi et al., 2 Mar 2024, Paul et al., 5 Aug 2025).

6. Complexity and Performance Analysis

The asymptotic operation count for an $n$ -point NTT (radix-2) is $O(n \log n)$ . For practical hardware, the actual number of modular multiplications and additions can be further reduced:

Method	NTT Mult. Count	INTT Mult. Count
Zero-padded	$n\log_2(2n)$	$n\log_2(2n)+2n$
NWC	$(n/2)\log_2 n + n$	$(n/2)\log_2 n + 2n$
LC-NWC	$(n/2)\log_2 n$	$(n/2)\log_2 n$

The LC-NWC (“low-complexity negative wrapped convolution” (Chiu et al., 2023)) achieves significant savings, with multiplier reductions on the order of $45\%-60\%$ relative to naive zero-padded methods.

Performance in hardware is further characterized by metrics such as area-delay product (ADP, ATP), throughput-per-area, power consumption, and scalability:

Accelerators such as HF-NTT (Meng et al., 7 Oct 2024), BP-NTT (Zhang et al., 2023), and pipelined digit-serial designs (Alexakis et al., 16 Jul 2025) consistently demonstrate either lower latency (as low as 2.7 μs for N=4096 at high parallelism), higher clock frequency (>600 MHz), significant area and power savings, or combinations thereof, often outperforming baseline and contemporary architectures under equivalent resource constraints.
SCE-NTT implementations (Razmkhah et al., 28 Aug 2025), operating at frequencies >30 GHz, project polynomial multiplication throughput exceeding CMOS by two orders of magnitude.
In software, the use of PyTorch or similar GPU frameworks allows matrix-based NTT computation (GNTT family (Cui et al., 16 Feb 2025)), with 62x speedup over “Fast-NTT” CPU code through tensor operations and precomputation.

7. Significance and Impact in Post-Quantum Cryptography

NTT’s role as an enabler for efficient, exact, and scalable arithmetic on high-degree polynomials is essential to the viability of lattice-based cryptography and privacy-preserving computation. The heavy dominance of polynomial multiplication in the total computational cost of RLWE, NTRU, and homomorphic operations positions NTT-optimized hardware as the linchpin for practical deployment.

The ability to accelerate core cryptographic routines—including key encapsulation, signature generation, ciphertext multiplication, and even matrix operations and batched SIMD-style computations—has influenced standardization (e.g., Kyber, Dilithium, ML-KEM, ML-DSA) and security frameworks for FHE and privacy-preserving machine learning (Liang et al., 2022, Ding et al., 18 May 2024, Sengupta et al., 7 Sep 2025). NTT-based computation also underlies the composability and signal processing capabilities of encrypted-domain protocols (Pedrouzo-Ulloa et al., 2016).

Conclusion

The Number Theoretic Transform is a mathematically rigorous and application-critical primitive that bridges discrete signal processing and cryptographic computation. Its design, implementation, and optimization—spanning hardware, algorithmic, and arithmetic perspectives—are central to the efficiency, robustness, and security of modern and future post-quantum and fully homomorphic systems. The developments summarized above collectively demonstrate the breadth and depth of NTT research and underline its continued evolution as demands for secure, scalable, and high-throughput computation intensify.