Orthogonal Memory Compression
- Orthogonal memory compression is a technique that uses orthonormal transforms to reduce the size of neural parameters, activations, and KV caches while retaining essential geometric and spectral properties.
- It employs methods like SVD, QR, and Cayley transforms to project high-dimensional data onto low-dimensional subspaces, balancing storage efficiency with minimal information loss.
- The approach facilitates practical deployment in LLM inference, fine-tuning, and dynamic memory architectures by ensuring non-redundant encoding and efficient reversibility, and it adapts to various neural architectures.
Orthogonal memory compression refers to a family of techniques that utilize orthogonality constraints and orthogonal transformations to compress and manage the storage, transfer, and computation of neural model parameters, intermediate activations, or memory states. Exploiting the mathematical properties of orthonormal bases, these approaches achieve high compression ratios, preserve critical geometric and spectral structures in the data, minimize information redundancy, and enable efficient deployment of large-scale models—particularly in applications such as LLM inference, parameter-efficient fine-tuning, structured weight compression, attention variants, and dynamic memory architectures.
1. Fundamental Principles of Orthogonal Memory Compression
Orthogonal memory compression is grounded in the properties of orthonormal matrices, which preserve inner products, spectral norms, and angular relationships under linear transformation. The core mechanisms typically involve projecting high-dimensional representations (parameters, activations, or memory slots) onto low-dimensional subspaces using orthogonal or nearly-orthogonal operators. Key advantages of orthogonality include:
- Information preservation: Orthogonal transforms retain vector norms and pairwise angles, minimizing distortions.
- Non-redundant encoding: Orthogonal memory slots or bases are maximally distinct, providing a compact yet expressive summary of the data.
- Efficient reversibility and implementation: Structured representations (e.g., SVD, QR, Cayley transforms) enable efficient storage, manipulation, and recovery.
These principles appear across methods focused on compressing LLM KV caches, activations, fine-tuning updates, transformer weights, or the memory states of recurrent and attention-based architectures (Lin et al., 16 Oct 2024, Wu et al., 16 May 2025, Grishina et al., 3 Jun 2025, Shi et al., 27 Sep 2025, Wang et al., 2021, Zhang et al., 2023, Karami et al., 8 Apr 2025, S et al., 24 Nov 2025).
2. Orthogonal Projection for Memory and KV-Cache Compression
In transformer inference and long-context language modeling, the key-value (KV) cache—typically a four-dimensional tensor—dominates memory usage. Orthogonal memory compression approaches reduce this bottleneck via low-rank projections.
MatryoshkaKV (Trainable Orthogonal Projection)
- Model: Applies trainable, Cayley-parameterized orthonormal matrices to compress the feature dimension of KV tensors (Lin et al., 16 Oct 2024).
- End-to-end objective: Minimizes next-token prediction KL-distillation loss under dynamically truncated .
- Matryoshka training: Simulates multiple compression rates by sampling at every step, resulting in a single U whose column prefixes progressively capture more information.
- Adaptive allocation: Greedily assigns per-layer/head rates to meet global memory budgets, optimizing accuracy/storage trade-off.
- Empirical results: At 62.5% compression (d / D ≈ 0.375), >93% zero-shot accuracy is retained versus catastrophic degradation for PCA.
SWAN (Decompression-Free Orthogonal KV-Cache Compression)
- Preprocessing: Computes an orthogonal rotation (via SVD on joint QK or VO statistics) for each attention layer/head (S et al., 24 Nov 2025).
- Compression: KV-cache tensors are rotated and top- dimensions are retained per timestep; remaining are aggressively pruned and stored sparsely.
- No decompression: Attention is computed directly on the rotated, pruned cache—leveraging invariance under orthogonal transformation so there is no need for explicit inversion.
- Runtime-tunability: The retention ratio is adjustable, permitting dynamic control of memory/accuracy trade-off.
- Empirical trade-off: Up to 60% memory saving is achieved with under 5% performance drop, and approach applies to off-the-shelf LLMs with no weight changes.
3. Orthogonal Fine-Tuning and Structured Weight Compression
Orthogonal transformations are also exploited for parameter-efficient adaptation and structured model compression.
MOFT (Memory-Efficient Orthogonal Fine-Tuning)
- Hyperspherical structure: Aims to preserve neural layer energy (sum of inverse angular distances), an isometry property of orthogonal maps (Wu et al., 16 May 2025).
- Principal subspace adaptation: Projects into the top- SVD subspace where , performs trainable rotation within this subspace, and reconstructs with a possible residual.
- Scaling relaxation: Two scaling vectors before/after allow partial norm/angle flexibility.
- Memory impact: Achieves 43–70% reduction in activation/GPU memory compared to prior orthogonal methods, often with better or equal accuracy.
ProcrustesGPT (Orthogonal+Structured Model Compression)
- Key observation: The output of a transformer is invariant to orthogonal conjugation, i.e., for and any structured matrix (Grishina et al., 3 Jun 2025).
- Compression pipeline:
- Solve a two-sided Procrustes problem: for in a structured low-parametric class.
- Alternate SVD/QR-based updates for .
- Fuse into incoming/outgoing transforms so only is stored/applied.
- Structured target classes: Rank-, sum-of-Kronecker, block-sparse, Group-and-Shuffle, tensor-train formats.
- Empirical results: Achieves 25–36% parameter reduction and 1.2–1.5 inference speedup with <15% degradation in perplexity at equivalent budgets vs. row/column pruning.
4. Orthogonal Compression in Activations and Sequence Memory
Beyond weights or KV caches, orthogonality enables efficient activation storage and compact dynamic memory states.
LoRAct (Low-Rank Orthogonal Activation Compression)
- Forward/backward mechanism: In the forward pass, each layer’s activation is factorized into with orthonormal, coefficients () (Shi et al., 27 Sep 2025).
- Sampling-based orthogonal decomposition: Uses subsampled rows (Nyström-like) and a few power iterations instead of Gaussian projections.
- Backward pass: Activation is reconstructed as as needed for gradient computation, keeping memory proportional to vs .
- Empirical savings: Achieves 80–90% activation memory reduction with negligible accuracy loss at moderate .
LAVO and Lattice (Orthogonal Memory in Attention and RNNs)
- LAVO: Summarizes an input sequence by projecting onto fixed orthonormal basis vectors; the compressed “orthogonal memory” collects sequence-wide information with maximal non-redundancy (Zhang et al., 2023).
- Lattice: Updates each memory slot at time with only the component of new input orthogonal to its current state, ensuring new information is non-redundant, and minimizing slot interference (Karami et al., 8 Apr 2025).
- Theoretical property: Each slot update is orthogonally projected, so slot directions span an increasingly informative subspace, and slot updates do not erase previous content.
- Scalability: Enables sub-quadratic time/space attention and sequence modeling, with empirical advantages at extremely large context lengths.
5. Enhanced SVD and Redundancy Elimination in Orthogonal Bases
Classical orthogonal decompositions such as SVD and QR, when used for compression, carry parameter redundancy due to strict orthonormality constraints. E-SVD removes this redundancy by directly parameterizing orthonormal factors (U, V) with minimal Givens angles.
- Parameter reduction: For a rank factorization of data, E-SVD reduces storage from (classical SVD) to —achieving up to 25% extra memory savings at the SVD’s compressibility limit (Wang et al., 2021).
- Algorithmic steps: Converts to Givens angle parametrization or reconstructs them efficiently, with cost.
- Applications: Extends to QR, eigendecomposition, Tucker/HOSVD, any orthonormal encoding.
6. Practical Considerations, Limitations, and Applications
Orthogonal memory compression is highly adaptive, robust, and compatible with both pretrained and fine-tuned transformer architectures.
- No main-model change: Leading schemes (e.g., MatryoshkaKV, SWAN, ProcrustesGPT) require only the insertion of orthogonal projections or fusing rotations into existing linear maps; no base weights are altered (Lin et al., 16 Oct 2024, S et al., 24 Nov 2025, Grishina et al., 3 Jun 2025).
- Heterogeneous compression: Many methods enable per-layer/head (or per-slot) selection of compression rates for optimally allocating limited resources.
- Calibration/data requirements: Some approaches (e.g., MatryoshkaKV, SWAN) require a small calibration set for optimal projection learning, but others (LAVO, LoRAct, E-SVD) can operate online or analytically.
- Limits and extensions: Absolute optimality of greedy allocation is not guaranteed; at high compression, nonlinear effects on model accuracy may arise. Future directions include combining feature-dimension compression with sequence-pruning or token-merging for ultra-long context modeling (Lin et al., 16 Oct 2024).
Orthogonal memory compression is now a standard methodology in the scalable deployment of deep networks, particularly in LLM inference, streaming attention, activation checkpointing, and low-footprint fine-tuning. It balances accuracy, storage, and compute cost while offering interpretability and broad extensibility. Current work spans plug-and-play cache compression, optimized structured format selection, online orthogonal updates for memory-augmented LMs, and rigorous redundancy elimination in classical matrix decompositions.