Cayley SGD: Manifold Optimization for PTQ in LLMs
- Cayley SGD is a manifold optimization technique that uses the Cayley transform to maintain strict orthogonality of rotation matrices, essential for post-training quantization in LLMs.
- It updates matrices on the Stiefel manifold without extra re-orthogonalization, ensuring efficient and stable learning of rotations.
- Empirical studies show that Cayley SGD reduces quantization errors and narrows performance gaps, enhancing the accuracy of large-scale deep learning models.
Cayley Stochastic Gradient Descent (Cayley SGD) is a manifold optimization technique designed for the efficient learning of orthogonal or unitary matrices, which constitute the Stiefel manifold, under the constraints imposed by orthogonality. Leveraging the Cayley transform, this approach ensures that iterates remain strictly on the manifold throughout the optimization process, and has been introduced as a crucial subroutine for learning rotation matrices in large-scale deep learning applications, particularly in post-training quantization (PTQ) for LLMs (Liu et al., 2024).
1. Mathematical Foundation and Manifold Constraints
The Stiefel manifold is the set of all matrices with orthonormal columns, i.e., . When rotations (orthogonal transformations) need to be learned—such as for weight or activation matrix reparameterizations—a naive application of standard SGD fails to preserve orthogonality, leading to drift off the manifold and violating crucial invariances.
Cayley SGD circumvents this by updating a current orthogonal matrix using the Cayley transform. Given a skew-symmetric matrix , the update is
where is the learning rate. The skew-symmetric is constructed from the Riemannian gradient of the loss with respect to :
This update guarantees that for any learning rate and precludes deviation from the manifold, avoiding extra re-orthogonalization or projections required by alternative schemes.
2. Applications to Quantization in Deep Learning
The primary application of Cayley SGD to date is in the context of PTQ for LLMs, with the introduction of SpinQuant (Liu et al., 2024). In this paradigm, model weights, activations, and Key-Value (KV) caches are quantized to low precision (e.g., 4 bits), but outliers in their distributions severely degrade quantization fidelity. Learned rotations allow for the redistribution (“mixing”) of outlier components, significantly reducing kurtosis and resulting quantization error.
Orthogonal matrices are introduced at positions in the network where, by the invariance of linear maps under orthogonal change of basis, recombining and across adjacent layers leaves the full-precision function unchanged but alters the quantization distortion landscape. Cayley SGD is employed to optimize these rotations based on task-centric loss (e.g., cross-entropy over a calibration set) and quantization error metrics, with all iterates guaranteeing orthogonality.
3. Algorithmic Implementation
The Cayley-SGD update is structured as follows for a given batch:
- Forward pass: Apply current rotations to the network and quantize weights/activations as per the targeted PTQ scheme.
- Loss computation: Compute task loss on quantized outputs.
- Gradient computation: Backpropagate to obtain .
- Cayley update: Perform the manifold-preserving update described above.
This approach incurs cost per update for rotations, but can utilize fixed-point approximations if necessary. In SpinQuant, Cayley-SGD is typically run for iterations with a linearly decayed step size ( starting at 1.5) and random Hadamard initialization for (Liu et al., 2024).
4. Empirical Benefits and Ablative Analysis
SpinQuant demonstrates that learned rotations, optimized via Cayley SGD, substantially close the performance gap between quantized and full-precision LLMs. Empirically, on tasks including BoolQ, PIQA, HellaSwag, and WinoGrande, SpinQuant_had (combining learned rotations with online Hadamard transforms) achieves less than 3 percentage points loss compared to full-precision in most settings, e.g., 2.9 points gap for LLaMA-2 7B on W4A4K4 quantization—a reduction of over 40% compared to previous methods relying on random or analytic rotations (Liu et al., 2024).
Layerwise SNR analysis shows that Cayley SGD-learned rotations disproportionately improve bottleneck layers where quantization noise is concentrated. Furthermore, across multiple initialization strategies, post-optimization accuracies are robust, and performance is stable with respect to calibration data volume and domain.
5. Theoretical Properties and Rotation Invariance
Cayley SGD leverages the property that, in full-precision transformer blocks, any orthogonal rotation inserted into the residual or MLP/attention pathways—followed by before non-linearities—leaves the network’s output invariant. This architectural invariance enables the learning of rotations that specifically target the reduction of quantization-induced loss, distinct from random or analytic choices (e.g., Hadamard, QuaRot).
Optimizing over the Stiefel manifold using Cayley SGD is mathematically necessary to ensure rotation matrices remain strictly orthogonal, and prevents numerical instabilities that would otherwise degrade downstream accuracy after many iterations.
6. Variants and Runtime Considerations
SpinQuant offers two main deployment variants:
- SpinQuant_no: Applies learned Cayley rotations offline, merging them into network weights (no runtime rotation overhead).
- SpinQuant_had: Incorporates additional online Hadamard blockwise rotations (R3, R4), with minimal inference time cost (e.g., 8% overhead on MacBook M1 Pro, 3.6% on 8×H100), ensuring robustness for extreme quantization scenarios such as W4A4K4 (Liu et al., 2024).
7. Impact and Future Directions
The introduction of Cayley SGD for optimizing network-invariant rotations has established a new state of the art for PTQ in open-source LLMs. It has demonstrated robustness to calibration domain, scaling, and quantization schemes, outperforming all previously proposed random or hand-crafted rotation baselines. The underlying principle—preserving orthogonality while optimizing task/quantization-specific objectives—may catalyze further development in manifold optimization for model compression, efficient inference, and neural architecture search (Liu et al., 2024).