TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Published 10 May 2026 in cs.LG | (2605.09281v1)

Abstract: Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a fine-tuning-free PTQ method using 2D tiling and activation-aware clustering to enable shared low-rank quantization across expert weights.
It leverages a fused inference algorithm that minimizes memory overhead and latency, achieving up to 10× memory reduction with nearly matching reconstruction accuracy.
Experimental results demonstrate robust performance at extreme 2-bit precision, with inference latency reduced to less than 5% of baseline methods.

TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Introduction and Motivation

Mixture-of-Experts (MoE) architectures are increasingly adopted in large-scale LLMs to achieve high parameter counts while reducing per-token computation via sparse expert activation. However, the practical deployment of MoE models remains constrained by their aggregate memory usage, as all expert parameters must remain resident in memory even for sparse inference scenarios. Although post-training quantization (PTQ) and low-rank factorization offer mechanisms for compressing such models, state-of-the-art solutions suffer significant trade-offs: extreme quantization leads to unacceptable degradation in accuracy, and naive low-rank approaches still incur considerable memory and computational overhead, especially as the number of experts increases.

TileQ introduces a novel, fine-tuning-free PTQ approach for MoE models that leverages a two-dimensional tiling of expert weights informed by singular subspace clustering, enabling parameter sharing across both input and output dimensions. This 2D-tiling structure combines with a fused, optimized inference algorithm to achieve significant reductions in both extra memory overhead and inference latency, while maintaining or surpassing the quantization quality of previous methods.

Methodology

2D-Tiling Low-Rank Quantization

Unlike per-expert or 1D-shared methods, TileQ organizes the weight matrices of all experts in a two-dimensional grid, based on biclustering of activation-aware singular vector embeddings of expert weights. This layout allows the method to share low-rank factors (U for input, V for output) across multiple experts, while a greedy assignment strategy ensures that experts with similar singular subspaces are clustered together in tiles, balancing structural coherence with practical constraints such as avoiding assignment conflicts.

The tiling process, including scaling, low-rank extraction (with SVD or a fast Gaussian sketch approximation), biclustering, optimal tile placement, global low-rank decomposition, and residual quantization (using Hessian-aware approaches such as GPTQ/GPTVQ), is performed in a fine-tuning-free pipeline. Notably, the use of shared low-rank factors and mixed-precision storage (fp8 for U/V, fp16 for singular values) drastically reduces storage requirements—even more so in models with a large expert count.

Fused Inference Algorithm

The inference path is correspondingly optimized: rather than launching independent small-kernel GEMMs or handling irregular sparse dispatch, TileQ projects all inputs through shared tiling factors in a way that permits batch-fused, dense GEMMs aligned with the hardware-friendly memory and compute patterns of modern accelerators. This transformation eliminates the typical kernel launch and bandwidth overheads associated with MoE models under quantization, ensuring low latency across both prefill and decode stages.

Theoretical Analysis

The authors provide rigorous error analysis bounding the quantization-induced reconstruction error in terms of both per-expert low-rank approximation error and the additional error from sharing low-rank factors within clusters (i.e., the tiling-induced error). Leveraging activation-aware clustering, they demonstrate that when the experts present coherent singular subspaces—an empirical property often satisfied in routing-based MoE—TileQ achieves nearly the same reconstruction error as fully independent decompositions, with vastly reduced parameter redundancy.

Compression ratio analysis establishes that the average extra bit-width per parameter in TileQ is reduced by a factor of up to 10× as compared to per-expert low-rank decomposition, and by up to 8× compared to 1D-shared low-rank methods, especially as the number of experts increases. Time complexity analysis confirms that the approach scales optimally with batch size, number of experts, and rank, with practical gains stemming from optimal hardware utilization and the removal of inefficient, memory-bound routines in conventional PTQ for MoE.

Experimental Results

Evaluations are conducted on multiple representative open-source MoE models (including Qwen1.5-MoE-A2.7B, Qwen3-30B-A3B, Mixtral-8x7B, and Deepseek-MoE-16B) and benchmarked against both general and MoE-specific PTQ baselines: GPTQ, GPTVQ, MOEQUANT, LoPRo, MILO, and MXMOE. Main findings include:

Robustness under Extreme Quantization: At 2-bit precision, where baselines (e.g., GPTQ) exhibit sharp performance drops (perplexity increases from 3.87 to 15.3 on Mixtral-8x7B), TileQ retains high fidelity (PPL 4.78—close to FP16).
Superiority in Memory Efficiency: The average extra bit-width allocated to low-rank components in TileQ is 0.04, yielding up to 10× reduction in additional memory over per-expert low-rank methods.
Inference Latency: MoE MLP inference latency with TileQ is less than 5% of baseline in both prefill and decode, compared to over 50% for prior shared low-rank approaches (MILO) in sparse models.
Scalability and Generalization: Across models with different numbers of experts and varying sparsity, TileQ's accuracy and efficiency gains persist, establishing its generality as a PTQ framework for both dense and sparse MoE settings.
Ablation Studies: The combination of 2D tiling, vector or scalar quantization, and rotation produces synergistic improvement. Removing the low-rank mechanism or tiling yields severe degradation or significantly higher memory usage.
Comparison to Concurrent Methods: TileQ consistently outperforms recent MoE quantization schemes such as MILO and MXMOE, both in downstream accuracy and efficiency metrics, even under matched evaluation protocols.

Practical and Theoretical Implications

The practical implications of TileQ are pronounced for edge deployment, cloud serving, and efficient experimentation with large-scale MoE and sparse LLM architectures. The reduction in memory footprint enables more experts per device, improved throughput, and lowered inference costs. Its hardware-friendly inference mechanism bypasses long-standing bottlenecks in MoE quantization, providing a direct path to production-level deployment without model retraining or architecture-specific fine-tuning.

From a theoretical perspective, the introduction of 2D biclustering and activation-aware singular subspace sharing within PTQ expands the repertoire of scalable low-rank quantization strategies. It incentivizes further research on structured parameter sharing, data-dependent factorization, and joint optimization of quantization and expert routing. The connections drawn with activation-induced subspace clustering highlight possible intersections with automated model compression, pruning, and knowledge distillation.

Future Directions

The main bottleneck in TileQ now shifts to the quantization step for low-rank structured weights (especially with large numbers of small experts). Research directions include custom quantization algorithms optimized for low-rank, tiled structures and the joint application of other compression techniques like structured expert pruning or distillation. Exploration of tile assignment strategies optimized for knowledge transfer or robustness (instead of pure subspace similarity) may yield further improvements.

Conclusion

TileQ establishes an efficient, scalable, and highly accurate post-training quantization pipeline for Mixture-of-Experts models via 2D-tiling of low-rank expert weights (2605.09281). The method exhibits strong performance in both compression ratio and inference speed across diverse MoE models, with theoretical guarantees grounded in spectral clustering and low-rank approximation theory. TileQ offers a viable solution for deployment of next-generation sparse architectures under memory and latency constraints, and serves as a foundation for future work at the intersection of quantization, compression, and expert modeling.

Markdown Report Issue