LinearlyCompressedGPT Compression Techniques
- LinearlyCompressedGPT is a family of methods that reduce computational and memory footprints in GPT-style models through linear modifications, factorization, or quantization without full retraining.
- The approach employs techniques like blockwise quantization, Kronecker and tensor-train decompositions, and hierarchical grouping to achieve up to 10× reduction in model size with minimal loss in accuracy.
- These methods enable efficient deployment and inference acceleration on hardware such as FPGAs, balancing trade-offs between compression and performance for real-world applications.
LinearlyCompressedGPT refers to a broad family of techniques for reducing the computational and memory footprint of GPT-style (decoder-only Transformer) LLMs through purely linear structural modifications, factorization, or quantization, usually without full retraining. Unlike distillation, pruning, or non-linear rewiring, these methods compress the large matrix multiplications and embeddings at the core of Transformer models using quantization, structured matrix approximations, hierarchical blockwise dimension reductions, or sparse and grouped computation. This enables deployment to resource-constrained environments while maintaining acceptable loss in performance. LinearlyCompressedGPT has been realized through a variety of frameworks; major lines of work include blockwise quantization, Kronecker and tensor-train (TT) decomposition, hierarchical dynamic grouping, and progressive depthwise linear projection.
1. Blockwise Quantization: The BCT Approach
Blockwise Compression of Transformers (BCT) implements LinearlyCompressedGPT through blockwise shift quantization on all linear matrices and bias vectors, without retraining (Dong et al., 2023). The method partitions each matrix into discrete B×B blocks and applies independent scale quantization within each. This sharply reduces quantization-induced distribution shift compared to per-layer schemes, eliminating retraining requirements.
Core Quantization Process:
Let , denote bit-width, .
- Compute the block’s shift:
- Quantize:
- Decompress:
At inference, GEMMs are computed in low-bit, exponent-aligned blocks.
Error Bound and Theoretical Guarantees:
Elementwise quantization error is bounded by half a quantization bin; error per layer is . Residuals do not induce out-of-distribution behavior globally, and empirical boxplots demonstrate per-block error does not propagate destructively.
Empirical Results:
- BERT-base (as a stand-in for GPT): $4$-bit weights plus $8$-bit activations yield model size reduction; accuracy loss (e.g., GLUE SST-2). Pure $8$-bit quantization yields reduction with near-zero loss.
- fp8 (8-bit float) BCT achieves size reduction with accuracy loss.
Block/Bit Parameterization and Trade-offs:
Block size is typical; for aggressive shrinkage, for zero-loss compression. Larger reduces meta-data but coarsens the quantization; smallest and largest that maintain acceptable perplexity are recommended.
2. Hierarchical Dynamic Grouping and Attention
Hierarchical models, exemplified by GPTHF, fundamentally restructure the Transformer, compressing token sequences into fixed-size sentence embeddings and operating subsequent transformations at the sentence level (Gu et al., 14 Mar 2025).
Architecture:
- A word-level Transformer encoder processes tokens within a sentence using block-local self-attention.
- Sentence-level representations are pooled, forming Pooling(wlt_encoder(...)) for each sentence.
- A second Transformer “body” operates causally over sentence embeddings.
Dynamic Sparse Attention:
- During encoding, attention is masked to intra-sentence blocks.
- At the sentence level, embeddings attend to all prior sentences.
Inference and Caching Optimization:
By caching finished sentence embeddings, GPTHF reuses computation, yielding per-token complexity that scales as (where is sentences), rather than tokenwise.
Empirical Trade-offs:
GPTHF achieves up to reduction in FLOPs and speedup on certain tasks, at the expense of -point perplexity penalty. Sentence-splitting is critical; generation quality can be affected by sentence-boundary prediction (Gu et al., 14 Mar 2025).
3. Structured Linear Factorizations: Kronecker, TT, and Orthogonal Transformations
Several schemes target the replacement of dense matrices with mathematically structured, low-parametric forms:
3.1 Kronecker Products
Krony-PT and KnGPT2 both compress transformer and embedding matrices via Kronecker factorizations (Ayad et al., 2024, Edalati et al., 2021). For a weight , choose , then approximate , with storage dropping to from .
- Krony-PT: Either single or multi-factor, leveraging Van Loan SVD initialization or pruning-based methods. Compression of the GPT-2 FFN from by factors $4$ and $1$ gives effective models of $81$M versus original $124$M—perplexity outperforms distillation baselines (Ayad et al., 2024).
- KnGPT2: Compresses half of all linear layers and embedding via rank-1 factorizations, recovers performance with only minimal pretraining (intermediate-layer KD) (Edalati et al., 2021).
3.2 Tensor-Train (TT) Decomposition
TTD reshapes large matrices into high-order tensors and factors them into sequential “cores” (Huang et al., 31 Jan 2025, Xu et al., 2023). Given , tensorize into -tuples, then represent as multiplication through a chain of TT-cores. Compression ratio is:
- TTD achieves layer-level compression up to and whole-network $1.6$–, with minimal loss: e.g., PPL, C-EVal for ChatGLM3-6B, LLaMA2-7B (Huang et al., 31 Jan 2025).
TT is particularly effective for the embedding layer (e.g., experimentally, – compression with negligible loss) (Xu et al., 2023).
3.3 Orthogonal Transforms and Structured Projections
ProcrustesGPT rotates weights via orthogonal to maximize compressibility under structured families such as Kronecker-sum or permutation-sparse matrices (Grishina et al., 3 Jun 2025). Layerwise alternating minimization is performed:
- Step A: Project weights into chosen structured class given .
- Step B: Solve a weighted Orthogonal Procrustes Problem to update .
Without fine-tuning, $14$– weight compression is attainable with consistently lower perplexity than other fine-tuning-free baselines (Grishina et al., 3 Jun 2025).
4. Architectural Linear Projection Variants
A distinct technique modifies the GPT stack by inserting linear dimensionality reductions between groups of layers. In the LinearGPT architecture (lc-gpt), after every two blocks, the hidden dimension is linearly halved, with intermediate linear layers (Suresh et al., 2024).
Structural Recursion:
This reduces total parameter count by and speeds up training by with no measurable loss in task performance on code-completion objectives.
5. Vocabulary and Output Layer Compression
High memory and compute cost in the output head can be dominated by the vocabulary projection. A two-level grouping approach partitions the vocabulary using BPE merges, then applies shared per-group linear transformations with per-group scale and shift (Vennam et al., 2024).
- For -way softmax, introduce groups and tokens per group. The softmax is decomposed as:
- Reduces activation memory up to and speeds up throughput by up to , with negligible drop in human-rated TinyStories metrics.
6. Hardware Mapping and Inference Acceleration
TTD-compressed models mapped to hardware such as FPGA via Group Vector Systolic Array (GVSA) architectures deliver further acceleration (Huang et al., 31 Jan 2025). Execution of TT-sharded matrix multiplies is serviced by parallel vector PEs, with pipelined partial sum reordering. ChatGLM3-6B and LLaMA2-7B deployed in this format achieved - first-token delay reductions and throughput exceeding optimized GPU baselines.
7. Trade-Offs, Limitations, and Selection Guidelines
- Compression vs. Accuracy: Aggressive quantization () or deep low-rank factorization provides compression up to , typically at – loss in perplexity or task accuracy, depending on the scheme (Dong et al., 2023, Gu et al., 14 Mar 2025, Ayad et al., 2024).
- Block/Rank Choices: Empirical sweep of block size, bit-width, Kronecker rank, or TT-rank is essential—default recommendations include block size , or $8$, Kronecker rank –$4$, TT ranks chosen to keep per-layer error within PPL.
- No-Retrainability: Methods such as BCT and ProcrustesGPT can be applied directly to a pretrained model with calibration data only, avoiding expensive retraining loops (Dong et al., 2023, Grishina et al., 3 Jun 2025).
- Applicability: Structure-based methods (Kronecker, TT) are amenable both to encoder and decoder (GPT) architectures and can be combined with other compression regimes (pruning, quantization).
- Limitations: Sentence-compression and hierarchical methods can induce sentence boundary and generation quality artifacts, especially on small models without auxiliary modeling (Gu et al., 14 Mar 2025).
References
- BCT: Blockwise Compression of Transformer-based Models without Retraining (Dong et al., 2023)
- GPTHF: Text Compression for Efficient Language Generation (Gu et al., 14 Mar 2025)
- ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations (Grishina et al., 3 Jun 2025)
- Krony-PT: GPT2 compressed with Kronecker Products (Ayad et al., 2024)
- KnGPT2: Kronecker Decomposition for GPT Compression (Edalati et al., 2021)
- TensorGPT: Efficient Compression of LLMs based on Tensor-Train Decomposition (Xu et al., 2023)
- LLM Vocabulary Compression for Low-Compute Environments (Vennam et al., 2024)
- LC-GPT: Towards smaller, faster decoder-only transformers (Suresh et al., 2024)
- TTD on FPGA: A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator (Huang et al., 31 Jan 2025)
LinearlyCompressedGPT frameworks thus offer a flexible design space, ranging from quantization to low-rank tensorization, for shrinking GPT-family models while retaining their essential generative capacity.