MXU-Centric RNS Lazy Reduction
- The paper introduces a hardware-tailored method that recasts Montgomery modular multiplication into low-precision matrix operations, eliminating fine-grained carry chains.
- It leverages an extended-RNS representation with byte-level decomposition to exploit the parallelism of AI ASICs’ MXUs, ensuring non-overflow and efficient residue computations.
- Empirical results show up to 90× speedup and significant energy efficiency gains in modular multiplications, enhancing performance in zero-knowledge proof systems.
MXU-centric RNS lazy reduction is a hardware-tailored numerical method that reformulates high-precision modular arithmetic, specifically for large-prime-field computations pivotal in zero-knowledge proof systems, to optimally leverage the matrix multiplication units (MXUs) present in AI ASICs such as TPUs. MORPH, the framework introducing this technique, recasts cryptographic primitives—most notably Montgomery modular multiplication—into dense, low-precision matrix multiplications, eliminating fine-grained carry chains and maximizing hardware throughput efficiency (Tong et al., 20 Apr 2026).
1. Mathematical Foundation: Extended-RNS Representation
The method operates over large prime fields , with ranging from 256 to 753 bits (for instance, ). An extended modulus , where each is a 32-bit pairwise-coprime modulus, is constructed such that . This ensures that all products with remain non-overflowing modulo . Each element is encoded via residue number system (RNS) as a vector 0, where 1. Every 32-bit residue component 2 is further byte-decomposed into four 8-bit slices 3, enabling direct mapping onto TPU MXUs, which are optimized for 8-bit matrix operations.
RNS arithmetic—addition and multiplication—are performed in a residue-wise manner:
4
5
This extended representation enables massive parallelism and eliminates cross-word carry propagation.
2. Algorithmic Workflow and Lazy Reduction
The core workflow comprises precomputation followed by a runtime Montgomery multiplication entirely performed in the extended-RNS domain.
Preprocessing
- Compute the extended modulus 6 and related constants, including 7 and 8 for Montgomery reduction.
- For each modulus 9, calculate 0 and 1 that facilitate lazy reduction, and for each 2, compute 3 for the post-processing correction.
- Precompute a 4D tensor 4 holding byte-level transformation coefficients:
5
Runtime: MXU-LazyMont Routine
- Compute 6 for all 7 as 32-bit vector ops.
- Compute 8 via 32-bit vector shift.
- Decompose each 9 into 8-bit slices 0.
- For each target modulus 1 and byte 2, perform the summation:
3
as a large 8-bit matrix multiplication (GEMM) on the MXU.
- Merge byte slices 4 to reconstruct 32-bit residues.
- Apply the final correction: 5.
The runtime workload is thus reduced to a single GEMM and supporting vector operations.
3. Mapping to TPU MXU Architecture
Computation is carefully partitioned between the TPU’s vectorized processing units (VPU) and its high-throughput MXU:
| Step | Hardware Unit | Operation Type |
|---|---|---|
| Preprocessing and vector dot/shifts | VPU (32-bit) | Scalar and vector ops (lines 1–2,5–6) |
| Byte (de)composition and merging | VPU (8/32-bit) | Simple lane extraction & merging |
| Large batched byte-level matrix mult. | MXU (8-bit) | (I·B)×(H·I) GEMM (main computational hotspot) |
The tiling of matrix multiplication on MXUs (e.g., 6) allows the computationally dominant part—byte-level accumulations across all residues and bytes—to be amortized over the MXU’s high bandwidth, leveraging 716× more MAC bandwidth than the VPU.
4. Carry-Chain Elimination
A salient feature is the systematic removal of inter-lane carry chains. All arithmetic after the initial RNS residue split operates exclusively within fixed-width quantities (8- or 32-bits), entirely within independent lanes. At no point does the method reconstruct or require carry propagation across the entire 8–9-bit integer space. Even the conditional subtraction step inherent in Montgomery reduction is subsumed within the final vector operation, and deferred until after batch processing or complete proof aggregation. This design both avoids fine-grained shuffles and matches the hardware’s natural granularity.
5. Complexity and Performance Analysis
The Big-T hardware-aware complexity metric captures pipeline and memory bottlenecks overlooked by conventional Big-O. In reference Radix-Montgomery implementations, the time-to-completion is dictated by carry-propagation shuffles, yielding:
0
In contrast, the MXU-centric RNS lazy reduction achieves:
1
because the 2 reduction is compressed into a single MXU GEMM, fully utilizing the high 3 (matrix-multiply throughput).
Empirically, on a TPUv6e8 (4, 5), this yields over 90× speedup on a single modular multiply compared to radix-based methods (Tong et al., 20 Apr 2026).
6. End-to-End System and Applications
Within a complete zero-knowledge proof (ZKP) system realized on MORPH:
- 753-bit NTT throughput achieves up to 10× higher rates than GZKP on an Nvidia V100.
- Batch modular multiplications (256/377/753 bits) are 50–157× faster than radix Montgomery, with gains increasing at higher bit-widths.
- 753-bit MSM achieves a 1.14–1.20× throughput increase compared to GZKP.
- Energy efficiency: proof-generation rates per watt are 2–10× higher than on GPU baselines.
The method is especially impactful for workloads dominated by multi-scalar multiplication and number-theoretic transforms—critical bottlenecks in contemporary ZKPs.
7. Significance and Limitations
MXU-centric extended-RNS lazy reduction provides a paradigm for recasting cryptographic modular arithmetic into a dataflow-aligned workload for AI accelerators. By structurally eliminating carry chains and engineering the modular reduction to fit the MXU’s low-precision matrix operations, the approach asymptotically reduces computational span from 6 to 7. The technique is explicitly matched to large prime fields and systems equipped with high-throughput matrix hardware.
A plausible implication is that similar mapping techniques could yield comparable benefits for other cryptographic or scientific domains relying on large-integer, modular arithmetic, provided hardware with comparable matrix-multiply units is available.
Referenced study: "Enabling AI ASICs for Zero Knowledge Proof" (Tong et al., 20 Apr 2026).