Papers
Topics
Authors
Recent
Search
2000 character limit reached

MXU-Centric RNS Lazy Reduction

Updated 23 April 2026
  • The paper introduces a hardware-tailored method that recasts Montgomery modular multiplication into low-precision matrix operations, eliminating fine-grained carry chains.
  • It leverages an extended-RNS representation with byte-level decomposition to exploit the parallelism of AI ASICs’ MXUs, ensuring non-overflow and efficient residue computations.
  • Empirical results show up to 90× speedup and significant energy efficiency gains in modular multiplications, enhancing performance in zero-knowledge proof systems.

MXU-centric RNS lazy reduction is a hardware-tailored numerical method that reformulates high-precision modular arithmetic, specifically for large-prime-field computations pivotal in zero-knowledge proof systems, to optimally leverage the matrix multiplication units (MXUs) present in AI ASICs such as TPUs. MORPH, the framework introducing this technique, recasts cryptographic primitives—most notably Montgomery modular multiplication—into dense, low-precision matrix multiplications, eliminating fine-grained carry chains and maximizing hardware throughput efficiency (Tong et al., 20 Apr 2026).

1. Mathematical Foundation: Extended-RNS Representation

The method operates over large prime fields Fp\mathbb{F}_p, with pp ranging from 256 to 753 bits (for instance, p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}). An extended modulus M=i=0I1qiM = \prod_{i=0}^{I-1} q_i, where each qiq_i is a 32-bit pairwise-coprime modulus, is constructed such that M>p2M > p^2. This ensures that all products aba \cdot b with a,bFpa, b \in \mathbb{F}_p remain non-overflowing modulo MM. Each element x[0,p)x \in [0,p) is encoded via residue number system (RNS) as a vector pp0, where pp1. Every 32-bit residue component pp2 is further byte-decomposed into four 8-bit slices pp3, enabling direct mapping onto TPU MXUs, which are optimized for 8-bit matrix operations.

RNS arithmetic—addition and multiplication—are performed in a residue-wise manner:

pp4

pp5

This extended representation enables massive parallelism and eliminates cross-word carry propagation.

2. Algorithmic Workflow and Lazy Reduction

The core workflow comprises precomputation followed by a runtime Montgomery multiplication entirely performed in the extended-RNS domain.

Preprocessing

  • Compute the extended modulus pp6 and related constants, including pp7 and pp8 for Montgomery reduction.
  • For each modulus pp9, calculate p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}0 and p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}1 that facilitate lazy reduction, and for each p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}2, compute p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}3 for the post-processing correction.
  • Precompute a 4D tensor p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}4 holding byte-level transformation coefficients:

p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}5

Runtime: MXU-LazyMont Routine

  1. Compute p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}6 for all p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}7 as 32-bit vector ops.
  2. Compute p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}8 via 32-bit vector shift.
  3. Decompose each p2256,2377,2753p \approx 2^{256}, 2^{377}, 2^{753}9 into 8-bit slices M=i=0I1qiM = \prod_{i=0}^{I-1} q_i0.
  4. For each target modulus M=i=0I1qiM = \prod_{i=0}^{I-1} q_i1 and byte M=i=0I1qiM = \prod_{i=0}^{I-1} q_i2, perform the summation:

M=i=0I1qiM = \prod_{i=0}^{I-1} q_i3

as a large 8-bit matrix multiplication (GEMM) on the MXU.

  1. Merge byte slices M=i=0I1qiM = \prod_{i=0}^{I-1} q_i4 to reconstruct 32-bit residues.
  2. Apply the final correction: M=i=0I1qiM = \prod_{i=0}^{I-1} q_i5.

The runtime workload is thus reduced to a single GEMM and supporting vector operations.

3. Mapping to TPU MXU Architecture

Computation is carefully partitioned between the TPU’s vectorized processing units (VPU) and its high-throughput MXU:

Step Hardware Unit Operation Type
Preprocessing and vector dot/shifts VPU (32-bit) Scalar and vector ops (lines 1–2,5–6)
Byte (de)composition and merging VPU (8/32-bit) Simple lane extraction & merging
Large batched byte-level matrix mult. MXU (8-bit) (I·B)×(H·I) GEMM (main computational hotspot)

The tiling of matrix multiplication on MXUs (e.g., M=i=0I1qiM = \prod_{i=0}^{I-1} q_i6) allows the computationally dominant part—byte-level accumulations across all residues and bytes—to be amortized over the MXU’s high bandwidth, leveraging M=i=0I1qiM = \prod_{i=0}^{I-1} q_i716× more MAC bandwidth than the VPU.

4. Carry-Chain Elimination

A salient feature is the systematic removal of inter-lane carry chains. All arithmetic after the initial RNS residue split operates exclusively within fixed-width quantities (8- or 32-bits), entirely within independent lanes. At no point does the method reconstruct or require carry propagation across the entire M=i=0I1qiM = \prod_{i=0}^{I-1} q_i8–M=i=0I1qiM = \prod_{i=0}^{I-1} q_i9-bit integer space. Even the conditional subtraction step inherent in Montgomery reduction is subsumed within the final vector operation, and deferred until after batch processing or complete proof aggregation. This design both avoids fine-grained shuffles and matches the hardware’s natural granularity.

5. Complexity and Performance Analysis

The Big-T hardware-aware complexity metric captures pipeline and memory bottlenecks overlooked by conventional Big-O. In reference Radix-Montgomery implementations, the time-to-completion is dictated by carry-propagation shuffles, yielding:

qiq_i0

In contrast, the MXU-centric RNS lazy reduction achieves:

qiq_i1

because the qiq_i2 reduction is compressed into a single MXU GEMM, fully utilizing the high qiq_i3 (matrix-multiply throughput).

Empirically, on a TPUv6e8 (qiq_i4, qiq_i5), this yields over 90× speedup on a single modular multiply compared to radix-based methods (Tong et al., 20 Apr 2026).

6. End-to-End System and Applications

Within a complete zero-knowledge proof (ZKP) system realized on MORPH:

  • 753-bit NTT throughput achieves up to 10× higher rates than GZKP on an Nvidia V100.
  • Batch modular multiplications (256/377/753 bits) are 50–157× faster than radix Montgomery, with gains increasing at higher bit-widths.
  • 753-bit MSM achieves a 1.14–1.20× throughput increase compared to GZKP.
  • Energy efficiency: proof-generation rates per watt are 2–10× higher than on GPU baselines.

The method is especially impactful for workloads dominated by multi-scalar multiplication and number-theoretic transforms—critical bottlenecks in contemporary ZKPs.

7. Significance and Limitations

MXU-centric extended-RNS lazy reduction provides a paradigm for recasting cryptographic modular arithmetic into a dataflow-aligned workload for AI accelerators. By structurally eliminating carry chains and engineering the modular reduction to fit the MXU’s low-precision matrix operations, the approach asymptotically reduces computational span from qiq_i6 to qiq_i7. The technique is explicitly matched to large prime fields and systems equipped with high-throughput matrix hardware.

A plausible implication is that similar mapping techniques could yield comparable benefits for other cryptographic or scientific domains relying on large-integer, modular arithmetic, provided hardware with comparable matrix-multiply units is available.

Referenced study: "Enabling AI ASICs for Zero Knowledge Proof" (Tong et al., 20 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXU-Centric RNS Lazy Reduction.