Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cayley SGD: Manifold Optimization for PTQ in LLMs

Updated 4 March 2026
  • Cayley SGD is a manifold optimization technique that uses the Cayley transform to maintain strict orthogonality of rotation matrices, essential for post-training quantization in LLMs.
  • It updates matrices on the Stiefel manifold without extra re-orthogonalization, ensuring efficient and stable learning of rotations.
  • Empirical studies show that Cayley SGD reduces quantization errors and narrows performance gaps, enhancing the accuracy of large-scale deep learning models.

Cayley Stochastic Gradient Descent (Cayley SGD) is a manifold optimization technique designed for the efficient learning of orthogonal or unitary matrices, which constitute the Stiefel manifold, under the constraints imposed by orthogonality. Leveraging the Cayley transform, this approach ensures that iterates remain strictly on the manifold throughout the optimization process, and has been introduced as a crucial subroutine for learning rotation matrices in large-scale deep learning applications, particularly in post-training quantization (PTQ) for LLMs (Liu et al., 2024).

1. Mathematical Foundation and Manifold Constraints

The Stiefel manifold St(n,p)\mathrm{St}(n,p) is the set of all n×pn\times p matrices with orthonormal columns, i.e., RR=IpR^\top R = I_p. When rotations (orthogonal transformations) need to be learned—such as for weight or activation matrix reparameterizations—a naive application of standard SGD fails to preserve orthogonality, leading to drift off the manifold and violating crucial invariances.

Cayley SGD circumvents this by updating a current orthogonal matrix RR using the Cayley transform. Given a skew-symmetric matrix YY, the update is

Rnew=(Iη2Y)1(I+η2Y)R,R_{\text{new}} = \left(I - \frac{\eta}{2}Y\right)^{-1} \left(I + \frac{\eta}{2}Y\right) R,

where η\eta is the learning rate. The skew-symmetric YY is constructed from the Riemannian gradient of the loss LQ\mathcal{L}_Q with respect to RR:

G=RLQ,G^=GR12RRGR,Y=G^G^.G = \nabla_R \mathcal{L}_Q, \quad \widehat{G} = G R^\top - \frac{1}{2} R R^\top G R^\top, \quad Y = \widehat{G} - \widehat{G}^\top.

This update guarantees that RnewRnew=IR_{\text{new}}^\top R_{\text{new}} = I for any learning rate η\eta and precludes deviation from the manifold, avoiding extra re-orthogonalization or projections required by alternative schemes.

2. Applications to Quantization in Deep Learning

The primary application of Cayley SGD to date is in the context of PTQ for LLMs, with the introduction of SpinQuant (Liu et al., 2024). In this paradigm, model weights, activations, and Key-Value (KV) caches are quantized to low precision (e.g., 4 bits), but outliers in their distributions severely degrade quantization fidelity. Learned rotations allow for the redistribution (“mixing”) of outlier components, significantly reducing kurtosis and resulting quantization error.

Orthogonal matrices RR are introduced at positions in the network where, by the invariance of linear maps under orthogonal change of basis, recombining RR and RR^\top across adjacent layers leaves the full-precision function unchanged but alters the quantization distortion landscape. Cayley SGD is employed to optimize these rotations based on task-centric loss (e.g., cross-entropy over a calibration set) and quantization error metrics, with all iterates guaranteeing orthogonality.

3. Algorithmic Implementation

The Cayley-SGD update is structured as follows for a given batch:

  1. Forward pass: Apply current rotations to the network and quantize weights/activations as per the targeted PTQ scheme.
  2. Loss computation: Compute task loss LQ\mathcal{L}_Q on quantized outputs.
  3. Gradient computation: Backpropagate to obtain RLQ\nabla_R \mathcal{L}_Q.
  4. Cayley update: Perform the manifold-preserving update described above.

This approach incurs O(D3)\mathcal{O}(D^3) cost per update for D×DD\times D rotations, but can utilize fixed-point approximations if necessary. In SpinQuant, Cayley-SGD is typically run for T=100T=100 iterations with a linearly decayed step size (η\eta starting at 1.5) and random Hadamard initialization for RR (Liu et al., 2024).

4. Empirical Benefits and Ablative Analysis

SpinQuant demonstrates that learned rotations, optimized via Cayley SGD, substantially close the performance gap between quantized and full-precision LLMs. Empirically, on tasks including BoolQ, PIQA, HellaSwag, and WinoGrande, SpinQuant_had (combining learned rotations with online Hadamard transforms) achieves less than 3 percentage points loss compared to full-precision in most settings, e.g., 2.9 points gap for LLaMA-2 7B on W4A4K4 quantization—a reduction of over 40% compared to previous methods relying on random or analytic rotations (Liu et al., 2024).

Layerwise SNR analysis shows that Cayley SGD-learned rotations disproportionately improve bottleneck layers where quantization noise is concentrated. Furthermore, across multiple initialization strategies, post-optimization accuracies are robust, and performance is stable with respect to calibration data volume and domain.

5. Theoretical Properties and Rotation Invariance

Cayley SGD leverages the property that, in full-precision transformer blocks, any orthogonal rotation RR inserted into the residual or MLP/attention pathways—followed by RR^\top before non-linearities—leaves the network’s output invariant. This architectural invariance enables the learning of rotations that specifically target the reduction of quantization-induced loss, distinct from random or analytic choices (e.g., Hadamard, QuaRot).

Optimizing over the Stiefel manifold using Cayley SGD is mathematically necessary to ensure rotation matrices remain strictly orthogonal, and prevents numerical instabilities that would otherwise degrade downstream accuracy after many iterations.

6. Variants and Runtime Considerations

SpinQuant offers two main deployment variants:

  • SpinQuant_no: Applies learned Cayley rotations offline, merging them into network weights (no runtime rotation overhead).
  • SpinQuant_had: Incorporates additional online Hadamard blockwise rotations (R3, R4), with minimal inference time cost (e.g., 8% overhead on MacBook M1 Pro, 3.6% on 8×H100), ensuring robustness for extreme quantization scenarios such as W4A4K4 (Liu et al., 2024).

7. Impact and Future Directions

The introduction of Cayley SGD for optimizing network-invariant rotations has established a new state of the art for PTQ in open-source LLMs. It has demonstrated robustness to calibration domain, scaling, and quantization schemes, outperforming all previously proposed random or hand-crafted rotation baselines. The underlying principle—preserving orthogonality while optimizing task/quantization-specific objectives—may catalyze further development in manifold optimization for model compression, efficient inference, and neural architecture search (Liu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cayley Stochastic Gradient Descent (Cayley SGD).