MXU-Centric RNS Lazy Reduction

Updated 23 April 2026

The paper introduces a hardware-tailored method that recasts Montgomery modular multiplication into low-precision matrix operations, eliminating fine-grained carry chains.
It leverages an extended-RNS representation with byte-level decomposition to exploit the parallelism of AI ASICs’ MXUs, ensuring non-overflow and efficient residue computations.
Empirical results show up to 90× speedup and significant energy efficiency gains in modular multiplications, enhancing performance in zero-knowledge proof systems.

MXU-centric RNS lazy reduction is a hardware-tailored numerical method that reformulates high-precision modular arithmetic, specifically for large-prime-field computations pivotal in zero-knowledge proof systems, to optimally leverage the matrix multiplication units (MXUs) present in AI ASICs such as TPUs. MORPH, the framework introducing this technique, recasts cryptographic primitives—most notably Montgomery modular multiplication—into dense, low-precision matrix multiplications, eliminating fine-grained carry chains and maximizing hardware throughput efficiency (Tong et al., 20 Apr 2026).

1. Mathematical Foundation: Extended-RNS Representation

The method operates over large prime fields $\mathbb{F}_p$ , with $p$ ranging from 256 to 753 bits (for instance, $p \approx 2^{256}, 2^{377}, 2^{753}$ ). An extended modulus $M = \prod_{i=0}^{I-1} q_i$ , where each $q_i$ is a 32-bit pairwise-coprime modulus, is constructed such that $M > p^2$ . This ensures that all products $a \cdot b$ with $a, b \in \mathbb{F}_p$ remain non-overflowing modulo $M$ . Each element $x \in [0,p)$ is encoded via residue number system (RNS) as a vector $p$ 0, where $p$ 1. Every 32-bit residue component $p$ 2 is further byte-decomposed into four 8-bit slices $p$ 3, enabling direct mapping onto TPU MXUs, which are optimized for 8-bit matrix operations.

RNS arithmetic—addition and multiplication—are performed in a residue-wise manner:

$p$ 4

$p$ 5

This extended representation enables massive parallelism and eliminates cross-word carry propagation.

2. Algorithmic Workflow and Lazy Reduction

The core workflow comprises precomputation followed by a runtime Montgomery multiplication entirely performed in the extended-RNS domain.

Preprocessing

Compute the extended modulus $p$ 6 and related constants, including $p$ 7 and $p$ 8 for Montgomery reduction.
For each modulus $p$ 9, calculate $p \approx 2^{256}, 2^{377}, 2^{753}$ 0 and $p \approx 2^{256}, 2^{377}, 2^{753}$ 1 that facilitate lazy reduction, and for each $p \approx 2^{256}, 2^{377}, 2^{753}$ 2, compute $p \approx 2^{256}, 2^{377}, 2^{753}$ 3 for the post-processing correction.
Precompute a 4D tensor $p \approx 2^{256}, 2^{377}, 2^{753}$ 4 holding byte-level transformation coefficients:

$p \approx 2^{256}, 2^{377}, 2^{753}$ 5

Runtime: MXU-LazyMont Routine

Compute $p \approx 2^{256}, 2^{377}, 2^{753}$ 6 for all $p \approx 2^{256}, 2^{377}, 2^{753}$ 7 as 32-bit vector ops.
Compute $p \approx 2^{256}, 2^{377}, 2^{753}$ 8 via 32-bit vector shift.
Decompose each $p \approx 2^{256}, 2^{377}, 2^{753}$ 9 into 8-bit slices $M = \prod_{i=0}^{I-1} q_i$ 0.
For each target modulus $M = \prod_{i=0}^{I-1} q_i$ 1 and byte $M = \prod_{i=0}^{I-1} q_i$ 2, perform the summation:

$M = \prod_{i=0}^{I-1} q_i$ 3

as a large 8-bit matrix multiplication (GEMM) on the MXU.

Merge byte slices $M = \prod_{i=0}^{I-1} q_i$ 4 to reconstruct 32-bit residues.
Apply the final correction: $M = \prod_{i=0}^{I-1} q_i$ 5.

The runtime workload is thus reduced to a single GEMM and supporting vector operations.

3. Mapping to TPU MXU Architecture

Computation is carefully partitioned between the TPU’s vectorized processing units (VPU) and its high-throughput MXU:

Step	Hardware Unit	Operation Type
Preprocessing and vector dot/shifts	VPU (32-bit)	Scalar and vector ops (lines 1–2,5–6)
Byte (de)composition and merging	VPU (8/32-bit)	Simple lane extraction & merging
Large batched byte-level matrix mult.	MXU (8-bit)	(I·B)×(H·I) GEMM (main computational hotspot)

The tiling of matrix multiplication on MXUs (e.g., $M = \prod_{i=0}^{I-1} q_i$ 6) allows the computationally dominant part—byte-level accumulations across all residues and bytes—to be amortized over the MXU’s high bandwidth, leveraging $M = \prod_{i=0}^{I-1} q_i$ 716× more MAC bandwidth than the VPU.

4. Carry-Chain Elimination

A salient feature is the systematic removal of inter-lane carry chains. All arithmetic after the initial RNS residue split operates exclusively within fixed-width quantities (8- or 32-bits), entirely within independent lanes. At no point does the method reconstruct or require carry propagation across the entire $M = \prod_{i=0}^{I-1} q_i$ 8– $M = \prod_{i=0}^{I-1} q_i$ 9-bit integer space. Even the conditional subtraction step inherent in Montgomery reduction is subsumed within the final vector operation, and deferred until after batch processing or complete proof aggregation. This design both avoids fine-grained shuffles and matches the hardware’s natural granularity.

5. Complexity and Performance Analysis

The Big-T hardware-aware complexity metric captures pipeline and memory bottlenecks overlooked by conventional Big-O. In reference Radix-Montgomery implementations, the time-to-completion is dictated by carry-propagation shuffles, yielding:

$q_i$ 0

In contrast, the MXU-centric RNS lazy reduction achieves:

$q_i$ 1

because the $q_i$ 2 reduction is compressed into a single MXU GEMM, fully utilizing the high $q_i$ 3 (matrix-multiply throughput).

Empirically, on a TPUv6e8 ( $q_i$ 4, $q_i$ 5), this yields over 90× speedup on a single modular multiply compared to radix-based methods (Tong et al., 20 Apr 2026).

6. End-to-End System and Applications

Within a complete zero-knowledge proof (ZKP) system realized on MORPH:

753-bit NTT throughput achieves up to 10× higher rates than GZKP on an Nvidia V100.
Batch modular multiplications (256/377/753 bits) are 50–157× faster than radix Montgomery, with gains increasing at higher bit-widths.
753-bit MSM achieves a 1.14–1.20× throughput increase compared to GZKP.
Energy efficiency: proof-generation rates per watt are 2–10× higher than on GPU baselines.

The method is especially impactful for workloads dominated by multi-scalar multiplication and number-theoretic transforms—critical bottlenecks in contemporary ZKPs.

7. Significance and Limitations

MXU-centric extended-RNS lazy reduction provides a paradigm for recasting cryptographic modular arithmetic into a dataflow-aligned workload for AI accelerators. By structurally eliminating carry chains and engineering the modular reduction to fit the MXU’s low-precision matrix operations, the approach asymptotically reduces computational span from $q_i$ 6 to $q_i$ 7. The technique is explicitly matched to large prime fields and systems equipped with high-throughput matrix hardware.

A plausible implication is that similar mapping techniques could yield comparable benefits for other cryptographic or scientific domains relying on large-integer, modular arithmetic, provided hardware with comparable matrix-multiply units is available.

Referenced study: "Enabling AI ASICs for Zero Knowledge Proof" (Tong et al., 20 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Enabling AI ASICs for Zero Knowledge Proof (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MXU-Centric RNS Lazy Reduction.

MXU-Centric RNS Lazy Reduction

1. Mathematical Foundation: Extended-RNS Representation

2. Algorithmic Workflow and Lazy Reduction

Preprocessing

Runtime: MXU-LazyMont Routine

3. Mapping to TPU MXU Architecture

4. Carry-Chain Elimination

5. Complexity and Performance Analysis

6. End-to-End System and Applications

7. Significance and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MXU-Centric RNS Lazy Reduction

1. Mathematical Foundation: Extended-RNS Representation

2. Algorithmic Workflow and Lazy Reduction

Preprocessing

Runtime: MXU-LazyMont Routine

3. Mapping to TPU MXU Architecture

4. Carry-Chain Elimination

5. Complexity and Performance Analysis

6. End-to-End System and Applications

7. Significance and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research