Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compressed LLMs: Efficiency & Trade-offs

Updated 4 March 2026
  • Compressed LLMs are compact, transformer-based models with reduced parameters and precision to lower energy and hardware requirements.
  • They employ techniques such as quantization, pruning, low-rank decomposition, and knowledge distillation to optimize performance and efficiency.
  • Empirical compression laws quantify trade-offs between model size, accuracy, and speed, informing strategies for various deployment scenarios.

Compressed LLMs are compact, resource-efficient versions of modern Transformer-based LLMs in which memory footprint, parameter count, arithmetic precision, or context-processing workload is reduced by applying model compression techniques. Compression is motivated by the prohibitive cost, latency, and energy associated with deploying full-scale LLMs on single-GPU servers, edge/mobile devices, or other resource-constrained settings. The field encompasses quantization, pruning, low-rank and vector decomposition, knowledge distillation, search for efficient sub-networks, entropy coding, and specialized techniques for long-context inference. Recent research has established empirical “compression laws” that quantitatively model the trade-offs between compression ratio, intrinsic/extrinsic performance, speed, and memory savings across large model families.

1. Compression Principles and Scaling Laws

Model compression exploits redundancy in LLMs by reducing the effective number of parameters or the bit-width allotted to each weight, neuron, or computation unit. The overarching goal is to shrink storage, runtime memory, bandwidth, and/or computational requirements while minimizing the loss in accuracy and linguistic capability.

Recent scaling analyses show structured compression induces predictable performance degradation, captured via empirical laws such as:

  • Test cross-entropy loss increases quadratically with compression ratio rr:

L(r)=L0α(1+r)βL(r) = L_0^{\,\alpha}(1+r)^{\beta}

with α0.74\alpha \approx 0.74, β2.02\beta \approx 2.02 for structured pruning (Sengupta et al., 6 Apr 2025).

  • Zero-shot accuracy drops linearly with rr:

A(r)A01.05A0rA(r) \approx A_0 - 1.05\,A_0\,r

Recovery fine-tuning (RFT) after pruning can partially remediate this loss, with up to 55% improvement in intrinsic loss and 14% in extrinsic accuracy, with performance modeled as an additional power-law in the RFT data size (Sengupta et al., 6 Apr 2025).

Compression also delivers substantial inference speedups at high ratios—up to 60% for large models (>7B), but only 24–35% for models ≤7B. Selective depth of compression and fine-tuning is required for other deployment regimes.

2. Methods of LLM Compression

Compression approaches subdivide into weight optimization (quantization, pruning, decomposition) and architectural optimization (sub-network search, context compression). Representative families are:

A. Quantization

  • Post-training or quantization-aware retraining maps continuous weights (or activations) to low-bit representations (e.g., INT8, INT4, 2–3 bit). Advanced PTQ techniques (GPTQ, AWQ, SpQR, QuIP#) leverage calibration data and second-order statistics to quantize LLMs to 3–4 bit with negligible accuracy loss for many tasks (Jaiswal et al., 2023, Zhu et al., 2023, Liu et al., 20 Apr 2025).
  • Mixed-precision, activation-aware, and outlier-targeted quantization avoid overflow on rare large activations (Zhu et al., 2023).

B. Pruning

C. Low-Rank and Vector Decomposition

D. Knowledge Distillation

E. Sub-Network and Architecture Search

  • Pareto-optimal sub-networks (attention heads, neurons, layers) are automatically selected using self-supervised or evolutionary NAS, yielding block-sparse variants that are empirically superior to uniform pruning (Sukthanker et al., 2024, Shen et al., 2024).

F. Entropy Coding and Double Compression

  • After quantization (e.g., INT8), statistical redundancy in the quantized weights can be exploited by entropy coders (e.g., ANS), with further sparsification (pruning) before coding (Wang et al., 21 Feb 2025). Speed-adaptive partial compression enables full-throughput inference with substantial memory savings.

G. Context Compression

  • For long-context tasks, context compression using sentence-anchored learned gist tokens allows 2×–8× reduction in prefix memory and compute, with proper attention masking and fine-tuning on top of base LLMs (Tarasov et al., 11 Nov 2025).

3. Empirical Benchmarks and Evaluation

Performance is assessed both intrinsically (perplexity, cross-entropy) and extrinsically (task-specific accuracy), with shortfalls in one often not predictive of the other (Jaiswal et al., 2023). Benchmarks such as LLM-KICK systematically probe knowledge-intensive tasks including closed-book QA, in-context retrieval, summarization, and instruction following.

  • Pruning beyond 20–30% unstructured or to moderate structured (N:M) levels causes catastrophic breakdowns on fact recall, even with negligible perplexity increase. Quantization to 4 bits is less disruptive for reasoning and knowledge-intensive tasks, with <2% average drop in dense accuracy (Jaiswal et al., 2023).
  • Newer zero-shot shape-preserving frameworks (e.g., NoWag) and Q+LR approaches (e.g., CALDERA) reach or exceed the performance of established methods in both perplexity and downstream tasks, often with less calibration data and no fine-tuning (Liu et al., 20 Apr 2025, Saha et al., 2024).
  • Multilingual LLM compression must balance accuracy across linguistic resource densities; calibration data proportional to pre-training corpus shares, as in Multilingual Brain Surgeon, effectively preserves low-resource performance (Zeng et al., 2024).

A representative selection of methods and benchmarks is summarized:

Methodology Representative Papers Task Regimes
PTQ (GPTQ, AWQ) (Jaiswal et al., 2023, Liu et al., 20 Apr 2025) 3–4 bit, English/Factual/Llama-2/3
Structured Prune (Sengupta et al., 6 Apr 2025, Ma et al., 2023) 10–50% block/attention/MLP prune
Low-Rank + Q (Lu et al., 21 Mar 2025, Saha et al., 2024) <2.5 bpp, multi-task/multilingual
NAS Subnetwork (Sukthanker et al., 2024, Shen et al., 2024) Pareto optima on 10–20 tasks
Double Compression (Wang et al., 21 Feb 2025) Deploy on GPU/edge post-INT8
Context/Gist (Tarasov et al., 11 Nov 2025) Long-context, summary/retrieval

4. Architectures, Algorithms, and Theoretical Guarantees

Compression algorithms operate over entire LLMs or per-layer, subject to various constraints:

  • Pruning can be unstructured (random or magnitude; limited practical speedup), group-wise (dependency- and structure-aware), or semi/fully structured (block, N:M, attention head, MLP channel) (Ma et al., 2023, Sukthanker et al., 2024, Jaiswal et al., 2023). Group coupling is critical to maintain graph validity, especially in transformers.
  • Quantization is typically performed post-training with calibration sets of 128–2,048 sequences, with more advanced variants (AWQ, GPTQ, NoWag) using per-channel normalizations to minimize activation/weight quantization error (Liu et al., 20 Apr 2025, Jaiswal et al., 2023).
  • Vector quantization applies data-aware K-means in block- or row-wise units, while low-rank methods (NSVD, CALDERA) allocate SVD rank budgets between activation-dominant and weight-reserves to avoid overfitting calibration statistics (Lu et al., 21 Mar 2025, Saha et al., 2024).
  • Empirical and theoretical analyses yield explicit trade-offs and error upper bounds as functions of target sparsity, rank, and bit-width (Sengupta et al., 6 Apr 2025, Saha et al., 2024).
  • Entropy coding coupled with INT8 quantization and aggressive pruning leverages enhanced zero run-lengths for >2× additional compression, with negligible loss if decompression speed is managed (Wang et al., 21 Feb 2025).

5. Practical Guidelines and Deployment Considerations

Best practices are now well established:

  • For factual/knowledge-intensive tasks, limit pruning to ≤30% unstructured or use block/structured methods with careful calibration; quantization to 4 bits is preferred if hardware and downstream tasks permit (Jaiswal et al., 2023, Zhu et al., 2023).
  • Always validate on downstream benchmarks beyond perplexity, as compressed models may pass intrinsic tests but fail extrinsically (Jaiswal et al., 2023).
  • Apply normalization (e.g., NoWag) or randomized transforms before quantization or VQ to prevent outlier domination and optimize block-wise K-means (Liu et al., 20 Apr 2025).
  • Sequence: distillation → structured pruning → quantization → entropy coding yields maximal compression with balanced performance (Girija et al., 5 May 2025).
  • For on-device/edge scenarios, prioritize transparent, shape-preserving, and calibration-driven methods, and where applicable, deploy double compression with speed-adaptive chunking (Wang et al., 21 Feb 2025).
  • For long-context applications, use learned compression tokens with attention mask adaptations for 2×–8× reductions in KV-cache and prefix FLOPs (Tarasov et al., 11 Nov 2025).
  • Multilingual models require calibration-sampling proportional to language resource share and similarity, as in MBS (Zeng et al., 2024).

Pitfalls include hardware underutilization from unstructured sparsity, overfitting to small calibration distributions, and phase transitions at high compression ratios where theoretical bounds may break down (Sengupta et al., 6 Apr 2025, Jaiswal et al., 2023, Lu et al., 21 Mar 2025).

6. Advanced Topics and Emerging Directions

Several active research trajectories are apparent:

7. Benchmarks, Metrics, and Limitations

Model assessment is multifaceted:

Metric Description
Perplexity Intrinsic cross-entropy/predictive power
Downstream accuracy Task-specific (QA, reasoning, etc.)
Compression ratio Original size/Compressed\text{Original size} / \text{Compressed}
Bits per parameter Total bits/Params\text{Total bits} / \text{Params}
Inference latency ms/token or ms/batch
Memory footprint GB per model (runtime/deployment)

Limitations remain: many results are limited to post-training compression, intrinsic benchmarks, or single-language settings; hardware-accelerator deployment of block-sparse or context-compressed models remains an area of active engineering; and model robustness under high compression is less understood for generation and few-shot settings (Sengupta et al., 6 Apr 2025, Lu et al., 21 Mar 2025, Jaiswal et al., 2023).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compressed Large Language Models (LLMs).