Papers
Topics
Authors
Recent
2000 character limit reached

LLM-as-a-Compressor Benchmark

Updated 22 November 2025
  • LLM-as-a-Compressor Benchmark is a unified evaluation framework that measures LLMs’ compression capabilities using metrics like cross-entropy and compression ratio.
  • It assesses both model weight and prompt-level compression while ensuring that key reasoning and semantic integrity are preserved.
  • The benchmark guides practical optimization methods such as quantization, pruning, and prompt compression to balance efficiency with overall performance.

A LLM-as-a-Compressor Benchmark provides a standardized, rigorous evaluation suite for measuring the compressive, representational, and operational efficiency of LLMs in compression-centric roles. This class of benchmarks spans algorithmic compression (both intrinsic model weight compression and data compression using LLM priors), prompt-level semantic compression for input minimization, and evaluation of retained reasoning/intelligence in the presence of model and/or prompt compression.

1. Theoretical and Methodological Foundations

LLM-as-a-Compressor benchmarks are grounded in classic information theory, particularly Shannon entropy and the Solomonoff prior. The central observation is that autoregressive LLMs pretrained by maximum-likelihood implicitly learn a probabilistic model pM(x)p_M(x), assigning high likelihood and thus shorter codes (via arithmetic coding) to in-domain sequences. Compression performance can be directly measured by cross-entropy H(p,M)=xp(x)log2pM(x)H(p, M) = -\sum_x p(x)\log_2 p_M(x), and code length LM(x)=log2pM(x)L_M(x) = -\log_2 p_M(x), with compression ratio R=original/compressedR = |original| / |compressed| as the metric of interest (Li et al., 24 Jun 2024, Guo et al., 20 Jun 2024). The tight coupling between model predictive power (as cross-entropy or NLL) and compressibility underlies unified evaluation protocols—lossless compression rate becomes a global marker of LLM quality, generalization, and domain transfer.

From the model compression perspective, the evaluation concerns the preservation of intelligence and utility in compressed forms: weights (sparsification/pruning, quantization), activation statistics, and external representations (distillation, low-rank approximations, KV-cache compression). Benchmarks like LLMCBench (Yang et al., 28 Oct 2024) instantiate multidimensional tracks measuring model efficiency, generalization, real-world deployment cost, and trustworthiness, directly connecting physical resource metrics (GPU RAM, latency, model storage) to ability/accuracy preservation.

2. Benchmark Design Principles and Dimensions

To ensure generality, reproducibility, and broad applicability, LLM-as-a-Compressor benchmarks adopt several core design principles:

  • Unified Protocols: All evaluated compression methods and LLMs are tested on shared backbones, datasets, and hardware/runtimes. This controls for confounding variables and enables multi-axis trade-off analysis (Yang et al., 28 Oct 2024, Chavan et al., 2 Feb 2024).
  • Comprehensive Modalities: State-of-the-art benchmarks extend across textual, visual, audio, point cloud, and cross-modal data, using standard corpora (e.g., Text8, WikiText2, CLIC2019, LibriSpeech, PileOfLaw, MPEG point clouds) (Li et al., 24 Jun 2024, Ye et al., 16 Aug 2024).
  • Task Aggregation: Downstream tasks encompass language modeling (perplexity), classification/QA, summarization, robustness (AdvGLUE), truthfulness (TruthfulQA), generation throughput, and semantic preservation under prompt compression (Yang et al., 28 Oct 2024, Zakazov et al., 15 Nov 2025).
  • Multi-faceted Metrics: Benchmarks explicitly decouple compression ratio (CR), inference speedup (SR), memory footprint reduction (MF), wall-clock and GPU consumption, as well as composite quadratic-mean metrics for overall ability retention (e.g., OMperfOM_{\mathrm{perf}}) (Yang et al., 28 Oct 2024).

Table: Example Evaluation Tracks and Metrics in LLMCBench (Yang et al., 28 Oct 2024) | Track | Metric(s) | Example Dataset/Model | |----------------------|----------------------|-----------------------------------| | Compression Perf | OMperfOM_{\mathrm{perf}} | MMLU, HellaSwag, ARC, WikiText2 | | Generalization | OMgenOM_{\mathrm{gen}} | WikiText2 on LLaMA, OPT, Vicuna | | Training Consumption | OMtrainOM_{\mathrm{train}}| Compression time, peak GPU RAM | | Inference Consumption| OMinfOM_{\mathrm{inf}} | Model/GPU memory, speed | | Hardware Acceleration| OMhardOM_{\mathrm{hard}} | tokens/s on vLLM, MLC-LLM, TensorRT| | Trustworthiness | OMtrustOM_{\mathrm{trust}}| AdvGLUE (robustness), TruthfulQA |

3. Classes of Compression Benchmarks

(a) LLM-Driven Data Compression

Autoregressive LLMs serve as universal compressors via their predictive distributions. LMCompress (Li et al., 24 Jun 2024) operationalizes this by tokenizing input data, using the next-token distribution for arithmetic coding, and achieving compression ratios well surpassing traditional algorithms—for example, R=6.55R=6.55 on CLIC2019 compared to JPEG-XL’s R=2.93R=2.93 for images, and R=10.48R=10.48–$16.81$ on legal/medical text vs R=2.96R=2.96–$5.13$ for zlib/brotli. Similar paradigms hold for video, audio, and point cloud data (Ye et al., 16 Aug 2024). The direct computation of the compression ratio via cumulative model NLL streamlines benchmarking without explicit bitstreams (Guo et al., 20 Jun 2024).

(b) Model Compression for Deployment

Benchmarks such as LLMCBench (Yang et al., 28 Oct 2024) and Faster and Lighter LLMs (Chavan et al., 2 Feb 2024) catalog and evaluate structured (column/head, 2:4 block) and unstructured (random, SparseGPT, WANDA) pruning, quantization (AWQ, SmoothQuant, GPTQ, OmniQuant), LoRA-style low-rank adaptation, and knowledge distillation. They provide empirically measured speedups, memory reductions, and quality degradation (perplexity, task accuracy). Hardware support and engine compatibility (TensorRT-LLM, vLLM, MLC-LLM, llama.cpp, ExLlama) are integral to inference cost benchmarking.

(c) Semantic and Prompt Compression

Prompt abstraction benchmarks (e.g., Cmprsr (Zakazov et al., 15 Nov 2025)) focus on compressing lengthy context inputs into brief, semantically rich representations using small LLMs, evaluated for both adherence to requested compression rate and downstream information preservation. Evaluation is on task-specific metrics (BERTScore-F1, QA EM/accuracy), cost-quality trade-off, generalizability across domains (MeetingBank, LongBench), and CR deviation.

4. Evaluation Tasks, Metrics, and Criteria

Core Metrics

  • Compression Ratio (CR): CR=originalcompressed\mathrm{CR} = \frac{|original|}{|compressed|}
  • Speedup Ratio (SR): SR=ToriginalTcompressed\mathrm{SR} = \frac{T_{original}}{T_{compressed}}
  • Memory Reduction (MF): MF=SoriginalScompressedMF = S_{original} - S_{compressed}
  • Task-specific Ability Retention: Quadratic mean aggregation of ability preservation for knowledge (OM_ka), inference (OM_ia), robustness/truthfulness (OMtrustOM_{trust}), with explicit reporting on trade-offs.

Semantic Compression/Prompt Compression Metrics

  • Semantic Preservation: BERTScore-F1, n-gram overlap, cross-entropy difference
  • QA Performance: Exact Match or accuracy over synthetic tasks
  • Adherence to Compression Rate: ΔCR=CRrealCRtarget\Delta CR = CR_{real} - CR_{target}

Model Compression Benchmarks

Experimental results report, for example, that INT8/INT4 quantization (AWQ, GPTQ) yields \geq98% downstream performance and up to $2$–4×4\times speed-up, with structured sparsity giving a 1.6×1.6\times inference speed-up where supported (Yang et al., 28 Oct 2024). Weight-only quantization generalizes best across model families (OMgen>93OM_{gen}>93), while activation quantization lags outside of originally targeted architectures. Prompt compression models (e.g., Cmprsr) demonstrate <2<2 point performance drop while closely tracking target compression ratios (Zakazov et al., 15 Nov 2025).

5. Functional Abilities: Beyond Standard Metrics

The Lottery LLM Hypothesis (Tang et al., 24 Feb 2025) posits that effective compression must preserve not only traditional metrics (perplexity, QA accuracy) but also five essential capabilities:

  1. Prompt Retrieval (NIAH): Robust extraction from long/noisy context.
  2. Resource/Tool Identification: Correct use of retrieval-augmented knowledge and API/tool calls.
  3. Planning/Decomposition: Maintenance of complex, multi-step reasoning and composition.
  4. Computational Expressivity: Accurate simulation of stack/memory operations and control flow.
  5. Long-context Reasoning: Retention of accuracy/perplexity as sequence length grows.

Precise benchmarks for these abilities use targeted tasks—Needle-In-A-Haystack retrieval, RAG-QA, arithmetic with tool integration, logical reasoning with solver augmentations, synthetic control-flow correctness, and long-context summarization QA. Compression is deemed functionally successful only if the compressed model retains \geq90% of teacher performance on each core task.

6. Limitations, Open Challenges, and Benchmark Evolution

Current LLM-as-a-Compressor benchmarks still face several limitations:

  • Tokenization and Vocabulary Drift: Different LLMs use incompatible token sets; fair cross-model comparison requires fixed tokenizations (Guo et al., 20 Jun 2024).
  • Computational Overhead: LM-based lossless compression is 10–100×\times slower and more memory-intensive than traditional codecs (Li et al., 24 Jun 2024), challenging real-time or on-device deployment. Prompt compressors (Cmprsr) mitigate this via offline precomputation (Zakazov et al., 15 Nov 2025).
  • Under-specified Evaluation: Standard perplexity or QA does not adequately evaluate preservation of planning, tool use, or control flow under compression (Tang et al., 24 Feb 2025). The consensus is shifting toward comprehensive evaluation suites, as in the blueprint of (Tang et al., 24 Feb 2025).
  • Hardware/Engine Dependencies: Speedup and resource benefits depend on engine support and hardware-specific kernels, especially for structured sparsity and sub-8-bit quantization (Yang et al., 28 Oct 2024, Chavan et al., 2 Feb 2024).
  • Extension to Multimodal and Cross-domain Compression: Recent advances (e.g., point cloud compression (Ye et al., 16 Aug 2024)) generalize LLM priors to non-textual data, but require bespoke adaptation (token mapping invariance, semantic-loss regularization).

7. Synthesis and Future Prospects

LLM-as-a-Compressor benchmarks advance the field by aligning information-theoretic metrics, practical deployment constraints, and functional capability assessment into unified evaluation frameworks. They provide clear actionable guidelines: when constrained on inference latency or memory, quantization dominates (AWQ/GPTQ, INT8/INT4); for domain or model generalization, weight-only quantization is superior; and for maximal prompt cost savings, pre-trained prompt compressors like Cmprsr enable fine-grained control over the cost-quality curve (Yang et al., 28 Oct 2024, Zakazov et al., 15 Nov 2025).

A plausible implication is that future benchmarks will further interleave functional task evaluation (retrieval, planning, tool-use) with compression metrics, and broaden to multimodal, cross-lingual, and hardware-adaptive regimes. Theoretical links to Solomonoff induction and universal Bayesian compressibility suggest a deeper convergence between model design, learning dynamics, and optimal data representations (Li et al., 24 Jun 2024). The continued evolution of these benchmarks will critically inform the responsible and efficient deployment of LLMs in real-world, resource-constrained settings.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LLM-as-a-Compressor Benchmark.