Number-Theoretic Transform (NTT)
- NTT is a discrete finite-field transform that efficiently computes polynomial multiplications and convolutions central to post-quantum cryptography and fully homomorphic encryption.
- Hardware implementations use radix-2 Cooley–Tukey butterfly iterations combined with Montgomery reduction to perform modular arithmetic with minimal delay and area overhead.
- Integrated fault-detection methods, such as REMO and Memory Rule Checkers, enable robust error identification in FPGA architectures, achieving coverage from 87% to 100%.
The Number-Theoretic Transform (NTT) is a discrete, finite-field analogue of the classical Discrete Fourier Transform (DFT), enabling efficient computation of polynomial multiplications and convolutions central to modern post-quantum cryptography (PQC) and fully homomorphic encryption (FHE). In cryptographic hardware and embedded systems, robust NTT architectures necessitate both high computational throughput and resilience against hardware faults, natural or adversarial. Recent advances center on lightweight, logic-embedded fault-detection strategies suitable for Field Programmable Gate Array (FPGA) realization without incurring prohibitive area, delay, or energy overheads (Paul et al., 5 Aug 2025).
1. Mathematical Definition and Transform Structure
Let be a prime such that and a primitive -th root of unity in . For a vector , the NTT and its inverse are defined as:
where is the modular inverse of modulo .
The standard hardware implementation follows a radix-2 Cooley–Tukey butterfly iteration: at each stage, pairs of elements are combined using a twiddle factor as
A fully pipelined architecture divides this computation into stages: buffering, modular multiplication, and modular addition/subtraction.
2. Modular Reduction and Butterfly Realization
Hardware-efficient NTTs replace direct modular reductions with Montgomery reduction. Given , the reduction
uses precomputed and chunk-wise, word-oriented operations, facilitating deployment onto FPGA DSP and logic slices. The Cooley–Tukey Butterfly Unit (CT-BU) thus incorporates modular multiplication and addition/subtraction entirely via digital logic, optimizing resource utilization.
3. Fault-Detection via REMO (Recomputation With Modular Offset)
REMO introduces a structural, ultra-light fault-detection primitive directly embedded within the word-wise Montgomery reduction block. It operates as follows:
- For each -bit window of an -bit operand , define a "fault-encoded" version
- Compute both normal reduction and offset reduction in parallel at each stage :
- Fault is flagged if for any .
This method guarantees that in the absence of faults, the modular offset cancels out, ensuring correct operation and negligible delay and area overhead compared to the baseline logic. Fault coverage achieved ranges from 87.2% to 100% across random and burst fault modes, and generalizes robustly to different word sizes and operating configurations (Paul et al., 5 Aug 2025).
4. Memory Fault Detection: Memory Rule Checkers
NTT datapaths involve multiple memory units: RAMs for polynomial data and ROMs for twiddle factors. Two independent rule checkers, MemoryRC, monitor for address-generation faults:
- RAM Checker (i–k rule): For each butterfly in stage , ensures index satisfies ()
- ROM Checker (i–j rule): Within stage , verifies twiddle index
Out-of-bounds or repeated access is immediately flagged as a soft fault. Empirical results demonstrate detection rates from 50.7% to 100%, with higher detection for burst errors and simultaneous RAM + ROM faults.
5. Empirical FPGA Evaluation and Resource-Performance Trade-offs
All methods were validated on Xilinx Artix-7 targets using Kyber-768 parameters (). Results:
| Variant | Slices | DSPs | Power (mW) | SEC Overhead | Coverage |
|---|---|---|---|---|---|
| Baseline CT-BU | 73 | 1 | 104 | — | — |
| + REMO | 81 | 2 | 106 | +17% area | 87–100% |
| + Memory RC | 89 | 2 | 107 | +8.5% area | 51–100% |
The total throughput is maintained at the baseline; area and power overheads remain under 8.5% and 2% respectively. Compared to prior approaches (e.g. RENO recomputation [Sarker et al.]: 15–24% area, 8–22% delay for ~99.5% logic coverage), this integrated defense achieves comparable or higher coverage at a fraction of the resource cost.
6. Context in PQC, Comparative Approaches, and Broader Impact
NTT-based polynomial multiplication is the computational linchpin of lattice-based PQC schemes (Kyber, NTRU, Ring-LWE, etc.) and so its reliability directly affects the security and throughput of these protocols. The presented REMO + Memory RC architecture sets a new benchmark in lightweight, application-integrated hardware fault tolerance, combining in-butterfly recomputation with modular offset and address-space rule-aware checking for full datapath resilience (Paul et al., 5 Aug 2025).
Relative to Hamming-code memories Khan et al., and to more general algorithm-level error detection (Ahmadi et al., 2024), this work is distinguished by sub-10% area and zero-latency cost at 87–100% logic and 51–100% memory coverage. These methods generalize natively across word sizes, bit-widths, and NTT parameter regimes, making them readily suitable for deployment in high-speed PQC network processors and side-channel-constrained cryptographic FPGAs.
7. Conclusions and Architectural Guidelines
Integrating REMO with modular offset and rule-checking logic in the core of the NTT pipeline enables robust, low-overhead hardware fault detection. This preserves critical performance metrics (area, energy, speed) while delivering near-complete detection of both transient and injected hardware faults. Such approaches are necessary for future network security processors and cryptographic accelerators that must operate reliably under both environmental noise and targeted adversarial conditions in post-quantum settings (Paul et al., 5 Aug 2025).