Papers
Topics
Authors
Recent
Search
2000 character limit reached

LinearlyCompressedGPT Compression Techniques

Updated 19 February 2026
  • LinearlyCompressedGPT is a family of methods that reduce computational and memory footprints in GPT-style models through linear modifications, factorization, or quantization without full retraining.
  • The approach employs techniques like blockwise quantization, Kronecker and tensor-train decompositions, and hierarchical grouping to achieve up to 10× reduction in model size with minimal loss in accuracy.
  • These methods enable efficient deployment and inference acceleration on hardware such as FPGAs, balancing trade-offs between compression and performance for real-world applications.

LinearlyCompressedGPT refers to a broad family of techniques for reducing the computational and memory footprint of GPT-style (decoder-only Transformer) LLMs through purely linear structural modifications, factorization, or quantization, usually without full retraining. Unlike distillation, pruning, or non-linear rewiring, these methods compress the large matrix multiplications and embeddings at the core of Transformer models using quantization, structured matrix approximations, hierarchical blockwise dimension reductions, or sparse and grouped computation. This enables deployment to resource-constrained environments while maintaining acceptable loss in performance. LinearlyCompressedGPT has been realized through a variety of frameworks; major lines of work include blockwise quantization, Kronecker and tensor-train (TT) decomposition, hierarchical dynamic grouping, and progressive depthwise linear projection.

1. Blockwise Quantization: The BCT Approach

Blockwise Compression of Transformers (BCT) implements LinearlyCompressedGPT through blockwise shift quantization on all linear matrices and bias vectors, without retraining (Dong et al., 2023). The method partitions each matrix into discrete B×B blocks and applies independent scale quantization within each. This sharply reduces quantization-induced distribution shift compared to per-layer schemes, eliminating retraining requirements.

Core Quantization Process:

Let xRB×Bx\in\mathbb{R}^{B\times B}, kk denote bit-width, Ik=[2k1,2k11]I_k=[-2^{k-1},2^{k-1}-1].

  • Compute the block’s shift:

shift(x)=log2(maxi,jxi,j/2k1)\text{shift}(x) = \lfloor \log_2\,(\max_{i,j}|x_{i,j}|\,/\,2^{k-1}) \rfloor

  • Quantize:

xc=clip(round(x2shift),2k1,2k11)x_c = \text{clip}\left(\text{round}(x \cdot 2^{-\text{shift}}), -2^{k-1},2^{k-1}-1\right)

  • Decompress:

Q1(xc;shift)=xc2shiftQ^{-1}(x_c;\text{shift}) = x_c \cdot 2^{\text{shift}}

At inference, GEMMs are computed in low-bit, exponent-aligned blocks.

Error Bound and Theoretical Guarantees:

Elementwise quantization error is bounded by half a quantization bin; error per layer is O(2k)O(2^{-k}). Residuals do not induce out-of-distribution behavior globally, and empirical boxplots demonstrate per-block error does not propagate destructively.

Empirical Results:

  • BERT-base (as a stand-in for GPT): $4$-bit weights plus $8$-bit activations yield 7.988×7.988\times model size reduction; <1%<1\% accuracy loss (e.g., 0.8%-0.8\% GLUE SST-2). Pure $8$-bit quantization yields 4×4\times reduction with near-zero loss.
  • fp8 (8-bit float) BCT achieves 4×4\times size reduction with <0.01%<0.01\% accuracy loss.

Block/Bit Parameterization and Trade-offs:

Block size B=32128B=32\ldots128 is typical; k=4k=4 for aggressive shrinkage, k=8k=8 for zero-loss compression. Larger BB reduces meta-data but coarsens the quantization; smallest kk and largest BB that maintain acceptable perplexity are recommended.

2. Hierarchical Dynamic Grouping and Attention

Hierarchical models, exemplified by GPTHF, fundamentally restructure the Transformer, compressing token sequences into fixed-size sentence embeddings and operating subsequent transformations at the sentence level (Gu et al., 14 Mar 2025).

Architecture:

  • A word-level Transformer encoder processes tokens within a sentence using block-local self-attention.
  • Sentence-level representations are pooled, forming ei=e_i = Poolingsi_{s_i}(wlt_encoder(...)) for each sentence.
  • A second Transformer “body” operates causally over sentence embeddings.

Dynamic Sparse Attention:

  • During encoding, attention is masked to intra-sentence blocks.
  • At the sentence level, embeddings attend to all prior sentences.

Inference and Caching Optimization:

By caching finished sentence embeddings, GPTHF reuses computation, yielding per-token complexity that scales as O(Ls2+S2)O(L_s^2+S^2) (where LsL_s is sentences), rather than O(N2)O(N^2) tokenwise.

Empirical Trade-offs:

GPTHF achieves up to 10×10\times reduction in FLOPs and 3×3\times speedup on certain tasks, at the expense of 5\sim 5-point perplexity penalty. Sentence-splitting is critical; generation quality can be affected by sentence-boundary prediction (Gu et al., 14 Mar 2025).

3. Structured Linear Factorizations: Kronecker, TT, and Orthogonal Transformations

Several schemes target the replacement of dense matrices with mathematically structured, low-parametric forms:

3.1 Kronecker Products

Krony-PT and KnGPT2 both compress transformer and embedding matrices via Kronecker factorizations (Ayad et al., 2024, Edalati et al., 2021). For a weight WRm×nW \in \mathbb{R}^{m\times n}, choose m=m1m2,n=n1n2m=m_1 m_2,\, n=n_1 n_2, then approximate WABW \approx A \otimes B, with storage dropping to m1n1+m2n2m_1 n_1 + m_2 n_2 from mnmn.

  • Krony-PT: Either single or multi-factor, leveraging Van Loan SVD initialization or pruning-based methods. Compression of the GPT-2 FFN from 3072×7683072\times768 by factors $4$ and $1$ gives effective models of $81$M versus original $124$M—perplexity outperforms distillation baselines (Ayad et al., 2024).
  • KnGPT2: Compresses half of all linear layers and embedding via rank-1 factorizations, recovers performance with only minimal pretraining (intermediate-layer KD) (Edalati et al., 2021).

3.2 Tensor-Train (TT) Decomposition

TTD reshapes large matrices into high-order tensors and factors them into sequential “cores” (Huang et al., 31 Jan 2025, Xu et al., 2023). Given WRM×NW\in\mathbb{R}^{M\times N}, tensorize M,NM,N into dd-tuples, then represent WW as multiplication through a chain of dd TT-cores. Compression ratio is:

CR=MNk=1d(mknk)rk1rk\text{CR} = \frac{MN}{\sum_{k=1}^d (m_k n_k) r_{k-1} r_k}

  • TTD achieves layer-level compression up to 1000×1000\times and whole-network $1.6$–1.94×1.94\times, with minimal loss: e.g., +2.62+2.62 PPL, 4.21-4.21 C-EVal for ChatGLM3-6B, LLaMA2-7B (Huang et al., 31 Jan 2025).

TT is particularly effective for the embedding layer (e.g., experimentally, 2×2\times3.3×3.3\times compression with negligible loss) (Xu et al., 2023).

3.3 Orthogonal Transforms and Structured Projections

ProcrustesGPT rotates weights via orthogonal QQ to maximize compressibility under structured families such as Kronecker-sum or permutation-sparse matrices (Grishina et al., 3 Jun 2025). Layerwise alternating minimization is performed:

  • Step A: Project weights into chosen structured class given QQ.
  • Step B: Solve a weighted Orthogonal Procrustes Problem to update QQ.

Without fine-tuning, $14$–25%25\% weight compression is attainable with consistently lower perplexity than other fine-tuning-free baselines (Grishina et al., 3 Jun 2025).

4. Architectural Linear Projection Variants

A distinct technique modifies the GPT stack by inserting linear dimensionality reductions between groups of layers. In the LinearGPT architecture (lc-gpt), after every two blocks, the hidden dimension is linearly halved, with intermediate linear layers W(i)RDi+1×DiW^{(i)}\in\mathbb{R}^{D_{i+1}\times D_i} (Suresh et al., 2024).

Structural Recursion:

x(i+1)=W(i+1)BlockDi(BlockDi(x(i)))\mathbf{x}^{(i+1)} = W^{(i+1)}\, \mathrm{Block}_{D_i}(\mathrm{Block}_{D_i} (\mathbf{x}^{(i)}))

This reduces total parameter count by 36%36\% and speeds up training by 19%19\% with no measurable loss in task performance on code-completion objectives.

5. Vocabulary and Output Layer Compression

High memory and compute cost in the output head can be dominated by the vocabulary projection. A two-level grouping approach partitions the vocabulary using BPE merges, then applies shared per-group linear transformations with per-group scale and shift (Vennam et al., 2024).

  • For v|v|-way softmax, introduce G=vG=\sqrt{|v|} groups and S=vS=\sqrt{|v|} tokens per group. The softmax is decomposed as:

pvocab(gS+t)=pgroup[g]ptoken|g[t]p_\text{vocab}(g\cdot S + t) = p_\text{group}[g]\cdot p_{\text{token|g}}[t]

  • Reduces activation memory up to 3.4×3.4\times and speeds up throughput by up to 3×3\times, with negligible drop in human-rated TinyStories metrics.

6. Hardware Mapping and Inference Acceleration

TTD-compressed models mapped to hardware such as FPGA via Group Vector Systolic Array (GVSA) architectures deliver further acceleration (Huang et al., 31 Jan 2025). Execution of TT-sharded matrix multiplies is serviced by parallel vector PEs, with pipelined partial sum reordering. ChatGLM3-6B and LLaMA2-7B deployed in this format achieved 1.45×1.45\times-1.57×1.57\times first-token delay reductions and throughput exceeding optimized GPU baselines.

7. Trade-Offs, Limitations, and Selection Guidelines

  • Compression vs. Accuracy: Aggressive quantization (k=4k=4) or deep low-rank factorization provides compression up to 8×8\times, typically at <1%<1\%5%5\% loss in perplexity or task accuracy, depending on the scheme (Dong et al., 2023, Gu et al., 14 Mar 2025, Ayad et al., 2024).
  • Block/Rank Choices: Empirical sweep of block size, bit-width, Kronecker rank, or TT-rank is essential—default recommendations include block size B=64B=64, k=4k=4 or $8$, Kronecker rank r=2r=2–$4$, TT ranks chosen to keep per-layer error within +0.5%+0.5\% PPL.
  • No-Retrainability: Methods such as BCT and ProcrustesGPT can be applied directly to a pretrained model with calibration data only, avoiding expensive retraining loops (Dong et al., 2023, Grishina et al., 3 Jun 2025).
  • Applicability: Structure-based methods (Kronecker, TT) are amenable both to encoder and decoder (GPT) architectures and can be combined with other compression regimes (pruning, quantization).
  • Limitations: Sentence-compression and hierarchical methods can induce sentence boundary and generation quality artifacts, especially on small models without auxiliary modeling (Gu et al., 14 Mar 2025).

References

LinearlyCompressedGPT frameworks thus offer a flexible design space, ranging from quantization to low-rank tensorization, for shrinking GPT-family models while retaining their essential generative capacity.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LinearlyCompressedGPT.