Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Published 13 Apr 2026 in cs.CV and cs.AI | (2604.11080v1)

Abstract: Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of LLMs. Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

Summary

  • The paper introduces ReSpinQuant, which efficiently merges dense layer-wise rotations into LLM weights for effective low-bit quantization.
  • It employs learnable orthogonal rotations and a novel subspace residual approximation to reduce computational complexity from quadratic to linear.
  • Empirical results on LLaMA models show lower perplexity and minimal latency overhead, highlighting its practical deployment benefits.

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

Motivation and Context

Quantization is essential for deploying LLMs on resource-constrained hardware. However, activation outliers in Transformer layers lead to large quantization errors, particularly under aggressive low-bit settings (W4A4, W3A3). Prior art addressed this with rotation-based post-training quantization (PTQ), primarily employing either a single, global rotation (enabling offline fusion but with limited expressivity) or full layer-wise rotations (high accuracy at substantial online computational cost). This paper proposes ReSpinQuant, which reconciles the expressivity-efficiency trade-off by permitting full layer-wise rotation fused into weights for negligible inference overhead and resolves basis mismatch in residuals via a novel subspace residual rotation approximation. Figure 1

Figure 1: Comparison of rotation paradigms: global rotation (efficiency, limited expressivity), layer-wise rotation (expressivity, high overhead), and ReSpinQuant's solution achieving both.

Methodology

Layer-Wise Rotation with Offline Fusion

ReSpinQuant assigns unique learnable orthogonal rotation matrices to each quantization-sensitive sub-block within each layer (e.g., MHSA, FFN). These rotations are initialized using Hadamard matrices and optimized on the orthogonal manifold via the Cayley optimizer. To maintain inference efficiency, all rotation transformations applicable to linear blocks are fully merged into their corresponding weight matrices before deployment. This strategy retains the parameterization and flexibility of unconstrained, dense layer-wise transformations during the PTQ phase, while reverting the runtime cost to nearly that of global-rotation systems. Figure 2

Figure 2: ReSpinQuant architecture, featuring per-layer rotation assignment, offline merging into weights, and subspace approximate rotation for residuals.

Subspace Residual Rotation Approximation

The key technical barrier is the basis mismatch in residual connections when distinct rotations are assigned to different blocks. Direct alignment would incur O(D2)\mathcal{O}(D^2) complexity, prohibitive for large D. Empirical evidence shows that post-optimization, the layer-wise rotations deviate minimally from initialization, indicating that the residual misalignment matrix is numerically close to identity and dominated by low-rank structure.

ReSpinQuant exploits this observation by projecting the residual difference into a principal subspace (identified via SVD), performing the rotation only within this subspace, and re-projecting back. The resulting operation is mathematically a low-rank correction, reducing computational complexity from quadratic to linear in D. Figure 3

Figure 3: Visualization of learned rotation matrix blocks and their relative transformation, displaying strong diagonal structure and low-rank deviation.

Figure 4

Figure 4: Training dynamics of rotation matrices: Frobenius deviation and cosine similarity confirm small, structured departures from initialization.

Figure 5

Figure 5: Subspace residual rotation approximation pipeline: project, rotate in subspace, and reproject for basis alignment with O(D)\mathcal{O}(D) cost.

Empirical Results

Accuracy and Efficiency

ReSpinQuant is evaluated against RTN, GPTQ, QuaRot, SpinQuant, OSTQuant, and FlatQuant on the LLaMA2/3 series. It consistently outperforms the strongest global-rotation baseline (SpinQuant) and matches or exceeds the best layer-wise alternatives in both WikiText-2 perplexity and a nine-task zero-shot accuracy mean.

  • On Llama-3 8B (W4A4): SpinQuant PPL is 7.50, while ReSpinQuant achieves 7.24.
  • On Llama-3.2 1B (W3A3): SpinQuant PPL is 69.70, ReSpinQuant reduces this to 49.90, demonstrating much-improved robustness under extreme quantization.

The method maximizes train-time parameterization (e.g., $1091$M learnable parameters on Llama3-8B), with <0.1% of these active during inference; online overhead is reduced to ~0.2% of total MACs—comparable to global rotation methods. The reported latency increase versus global rotation is marginal: 1.7% TTIT latency overhead at batch size 16. Calibration completes within an hour even for 8B-scale models.

Ablative and Analytical Findings

An ablation on the subspace approximation rank rr reveals that even very low ranks (e.g., r=32r=32, ≪D\ll D) recover nearly full performance, supporting the low-rank nature of the basis mismatch. Increasing rr yields diminishing returns and minimal accuracy differences (<0.3 PPL) past r=32r=32.

Theoretical and Practical Implications

ReSpinQuant’s methodology demonstrates that fully expressive layer-wise rotation can be merged into static model weights provided that structured, numerically small mismatches are corrected in a principal subspace. Theoretically, this implies that the dominant error structure in rotation-aligned LLM activations is amenable to very efficient approximation. Practically, this design offers a scalable route to high-accuracy, low-bit quantization suitable for edge or low-resource deployment without the severe throughput penalty seen in prior layer-wise quantization schemes.

ReSpinQuant also shows a strong Pareto frontier shift: quantized larger models can exceed both the accuracy and memory footprint of full-precision smaller ones. This will likely enable more widespread adoption of quantized LLMs in bandwidth- and energy-limited applications.

Future Directions

The present work is focused on standard cross-entropy calibration for LLMs up to 13B scale. Integration with advanced, possibly layer-wise, calibration objectives may yield further accuracy improvements. Moreover, the development of dedicated hardware kernels for low-bit, low-rank subspace corrections would realize even greater deployment efficiency. Finally, extension to even larger-scale LLMs and more diverse architectures remains an open avenue.

Conclusion

ReSpinQuant resolves the longstanding expressivity-efficiency trade-off in post-training quantization by merging dense, layer-wise rotations into LLM weights and addressing residual basis mismatch via a subspace rotation approximation. It achieves state-of-the-art quantization performance with negligible inference cost, as experimentally validated across multiple LLM families and challenging quantization settings. This framework sets a new standard for efficient, accurate deployment of LLMs under aggressive resource constraints, with clear avenues for future methodological and practical enhancements.


Reference:

"ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation" (2604.11080)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.