LLM-as-a-Compressor Benchmark

Updated 22 November 2025

LLM-as-a-Compressor Benchmark is a unified evaluation framework that measures LLMs’ compression capabilities using metrics like cross-entropy and compression ratio.
It assesses both model weight and prompt-level compression while ensuring that key reasoning and semantic integrity are preserved.
The benchmark guides practical optimization methods such as quantization, pruning, and prompt compression to balance efficiency with overall performance.

A LLM-as-a-Compressor Benchmark provides a standardized, rigorous evaluation suite for measuring the compressive, representational, and operational efficiency of LLMs in compression-centric roles. This class of benchmarks spans algorithmic compression (both intrinsic model weight compression and data compression using LLM priors), prompt-level semantic compression for input minimization, and evaluation of retained reasoning/intelligence in the presence of model and/or prompt compression.

1. Theoretical and Methodological Foundations

LLM-as-a-Compressor benchmarks are grounded in classic information theory, particularly Shannon entropy and the Solomonoff prior. The central observation is that autoregressive LLMs pretrained by maximum-likelihood implicitly learn a probabilistic model $p_M(x)$ , assigning high likelihood and thus shorter codes (via arithmetic coding) to in-domain sequences. Compression performance can be directly measured by cross-entropy $H(p, M) = -\sum_x p(x)\log_2 p_M(x)$ , and code length $L_M(x) = -\log_2 p_M(x)$ , with compression ratio $R = |original| / |compressed|$ as the metric of interest (Li et al., 2024, Guo et al., 2024). The tight coupling between model predictive power (as cross-entropy or NLL) and compressibility underlies unified evaluation protocols—lossless compression rate becomes a global marker of LLM quality, generalization, and domain transfer.

From the model compression perspective, the evaluation concerns the preservation of intelligence and utility in compressed forms: weights (sparsification/pruning, quantization), activation statistics, and external representations (distillation, low-rank approximations, KV-cache compression). Benchmarks like LLMCBench (Yang et al., 2024) instantiate multidimensional tracks measuring model efficiency, generalization, real-world deployment cost, and trustworthiness, directly connecting physical resource metrics (GPU RAM, latency, model storage) to ability/accuracy preservation.

2. Benchmark Design Principles and Dimensions

To ensure generality, reproducibility, and broad applicability, LLM-as-a-Compressor benchmarks adopt several core design principles:

Unified Protocols: All evaluated compression methods and LLMs are tested on shared backbones, datasets, and hardware/runtimes. This controls for confounding variables and enables multi-axis trade-off analysis (Yang et al., 2024, Chavan et al., 2024).
Comprehensive Modalities: State-of-the-art benchmarks extend across textual, visual, audio, point cloud, and cross-modal data, using standard corpora (e.g., Text8, WikiText2, CLIC2019, LibriSpeech, PileOfLaw, MPEG point clouds) (Li et al., 2024, Ye et al., 2024).
Task Aggregation: Downstream tasks encompass language modeling (perplexity), classification/QA, summarization, robustness (AdvGLUE), truthfulness (TruthfulQA), generation throughput, and semantic preservation under prompt compression (Yang et al., 2024, Zakazov et al., 15 Nov 2025).
Multi-faceted Metrics: Benchmarks explicitly decouple compression ratio (CR), inference speedup (SR), memory footprint reduction (MF), wall-clock and GPU consumption, as well as composite quadratic-mean metrics for overall ability retention (e.g., $OM_{\mathrm{perf}}$ ) (Yang et al., 2024).

Table: Example Evaluation Tracks and Metrics in LLMCBench (Yang et al., 2024) | Track | Metric(s) | Example Dataset/Model | |----------------------|----------------------|-----------------------------------| | Compression Perf | $OM_{\mathrm{perf}}$ | MMLU, HellaSwag, ARC, WikiText2 | | Generalization | $OM_{\mathrm{gen}}$ | WikiText2 on LLaMA, OPT, Vicuna | | Training Consumption | $OM_{\mathrm{train}}$ | Compression time, peak GPU RAM | | Inference Consumption| $OM_{\mathrm{inf}}$ | Model/GPU memory, speed | | Hardware Acceleration| $OM_{\mathrm{hard}}$ | tokens/s on vLLM, MLC-LLM, TensorRT| | Trustworthiness | $OM_{\mathrm{trust}}$ | AdvGLUE (robustness), TruthfulQA |

3. Classes of Compression Benchmarks

(a) LLM-Driven Data Compression

Autoregressive LLMs serve as universal compressors via their predictive distributions. LMCompress (Li et al., 2024) operationalizes this by tokenizing input data, using the next-token distribution for arithmetic coding, and achieving compression ratios well surpassing traditional algorithms—for example, $R=6.55$ on CLIC2019 compared to JPEG-XL’s $R=2.93$ for images, and $R=10.48$ –$16.81$ on legal/medical text vs $R=2.96$ –$5.13$ for zlib/brotli. Similar paradigms hold for video, audio, and point cloud data (Ye et al., 2024). The direct computation of the compression ratio via cumulative model NLL streamlines benchmarking without explicit bitstreams (Guo et al., 2024).

(b) Model Compression for Deployment

Benchmarks such as LLMCBench (Yang et al., 2024) and Faster and Lighter LLMs (Chavan et al., 2024) catalog and evaluate structured (column/head, 2:4 block) and unstructured (random, SparseGPT, WANDA) pruning, quantization (AWQ, SmoothQuant, GPTQ, OmniQuant), LoRA-style low-rank adaptation, and knowledge distillation. They provide empirically measured speedups, memory reductions, and quality degradation (perplexity, task accuracy). Hardware support and engine compatibility (TensorRT-LLM, vLLM, MLC-LLM, llama.cpp, ExLlama) are integral to inference cost benchmarking.

(c) Semantic and Prompt Compression

Prompt abstraction benchmarks (e.g., Cmprsr (Zakazov et al., 15 Nov 2025)) focus on compressing lengthy context inputs into brief, semantically rich representations using small LLMs, evaluated for both adherence to requested compression rate and downstream information preservation. Evaluation is on task-specific metrics (BERTScore-F1, QA EM/accuracy), cost-quality trade-off, generalizability across domains (MeetingBank, LongBench), and CR deviation.

4. Evaluation Tasks, Metrics, and Criteria

Core Metrics

Compression Ratio (CR): $\mathrm{CR} = \frac{|original|}{|compressed|}$
Speedup Ratio (SR): $\mathrm{SR} = \frac{T_{original}}{T_{compressed}}$
Memory Reduction (MF): $MF = S_{original} - S_{compressed}$
Task-specific Ability Retention: Quadratic mean aggregation of ability preservation for knowledge (OM_ka), inference (OM_ia), robustness/truthfulness ( $OM_{trust}$ ), with explicit reporting on trade-offs.

Semantic Compression/Prompt Compression Metrics

Semantic Preservation: BERTScore-F1, n-gram overlap, cross-entropy difference
QA Performance: Exact Match or accuracy over synthetic tasks
Adherence to Compression Rate: $\Delta CR = CR_{real} - CR_{target}$

Model Compression Benchmarks

Experimental results report, for example, that INT8/INT4 quantization (AWQ, GPTQ) yields $\geq$ 98% downstream performance and up to $2$– $4\times$ speed-up, with structured sparsity giving a $1.6\times$ inference speed-up where supported (Yang et al., 2024). Weight-only quantization generalizes best across model families ( $OM_{gen}>93$ ), while activation quantization lags outside of originally targeted architectures. Prompt compression models (e.g., Cmprsr) demonstrate $<2$ point performance drop while closely tracking target compression ratios (Zakazov et al., 15 Nov 2025).

5. Functional Abilities: Beyond Standard Metrics

The Lottery LLM Hypothesis (Tang et al., 24 Feb 2025) posits that effective compression must preserve not only traditional metrics (perplexity, QA accuracy) but also five essential capabilities:

Prompt Retrieval (NIAH): Robust extraction from long/noisy context.
Resource/Tool Identification: Correct use of retrieval-augmented knowledge and API/tool calls.
Planning/Decomposition: Maintenance of complex, multi-step reasoning and composition.
Computational Expressivity: Accurate simulation of stack/memory operations and control flow.
Long-context Reasoning: Retention of accuracy/perplexity as sequence length grows.

Precise benchmarks for these abilities use targeted tasks—Needle-In-A-Haystack retrieval, RAG-QA, arithmetic with tool integration, logical reasoning with solver augmentations, synthetic control-flow correctness, and long-context summarization QA. Compression is deemed functionally successful only if the compressed model retains $\geq$ 90% of teacher performance on each core task.

6. Limitations, Open Challenges, and Benchmark Evolution

Current LLM-as-a-Compressor benchmarks still face several limitations:

Tokenization and Vocabulary Drift: Different LLMs use incompatible token sets; fair cross-model comparison requires fixed tokenizations (Guo et al., 2024).
Computational Overhead: LM-based lossless compression is 10–100 $\times$ slower and more memory-intensive than traditional codecs (Li et al., 2024), challenging real-time or on-device deployment. Prompt compressors (Cmprsr) mitigate this via offline precomputation (Zakazov et al., 15 Nov 2025).
Under-specified Evaluation: Standard perplexity or QA does not adequately evaluate preservation of planning, tool use, or control flow under compression (Tang et al., 24 Feb 2025). The consensus is shifting toward comprehensive evaluation suites, as in the blueprint of (Tang et al., 24 Feb 2025).
Hardware/Engine Dependencies: Speedup and resource benefits depend on engine support and hardware-specific kernels, especially for structured sparsity and sub-8-bit quantization (Yang et al., 2024, Chavan et al., 2024).
Extension to Multimodal and Cross-domain Compression: Recent advances (e.g., point cloud compression (Ye et al., 2024)) generalize LLM priors to non-textual data, but require bespoke adaptation (token mapping invariance, semantic-loss regularization).

7. Synthesis and Future Prospects

LLM-as-a-Compressor benchmarks advance the field by aligning information-theoretic metrics, practical deployment constraints, and functional capability assessment into unified evaluation frameworks. They provide clear actionable guidelines: when constrained on inference latency or memory, quantization dominates (AWQ/GPTQ, INT8/INT4); for domain or model generalization, weight-only quantization is superior; and for maximal prompt cost savings, pre-trained prompt compressors like Cmprsr enable fine-grained control over the cost-quality curve (Yang et al., 2024, Zakazov et al., 15 Nov 2025).

A plausible implication is that future benchmarks will further interleave functional task evaluation (retrieval, planning, tool-use) with compression metrics, and broaden to multimodal, cross-lingual, and hardware-adaptive regimes. Theoretical links to Solomonoff induction and universal Bayesian compressibility suggest a deeper convergence between model design, learning dynamics, and optimal data representations (Li et al., 2024). The continued evolution of these benchmarks will critically inform the responsible and efficient deployment of LLMs in real-world, resource-constrained settings.