Papers
Topics
Authors
Recent
Search
2000 character limit reached

LLMCBench: Benchmark for LLM Compression

Updated 10 March 2026
  • LLMCBench is a multi-track framework that benchmarks LLM compression techniques by measuring accuracy retention, efficiency, and trustworthiness.
  • It employs well-defined metrics, such as OM_perf and OM_inf, to offer actionable insights for practical deployment across diverse models and datasets.
  • The benchmark addresses limitations of previous studies by including evaluations of hardware acceleration, resource costs, and robust post-compression performance.

LLMCBench is a comprehensive, rigorously designed benchmarking framework for evaluating compression techniques applied to LLMs, with a focus on practical deployment requirements. It is explicitly intended to address the deficiencies of prior work, which often limit evaluation to idealized metrics, a small selection of models or datasets, or omit critical components such as trustworthiness and resource costs. LLMCBench provides an end-to-end protocol—empirical, transparent, and extensible—for quantifying the trade-offs between model compression, inference efficiency, robustness, generalizability, hardware acceleration, and trustworthiness, using well-defined multi-track metrics and unified evaluation settings across a diverse set of models, datasets, and real-world deployment stacks (Yang et al., 2024).

1. Motivation and Benchmarking Scope

LLMs (e.g., GPT, LLaMA) achieve advanced results in knowledge and reasoning tasks but are hampered by massive storage footprints, high GPU memory demands at inference, and burdensome fine-tuning/compression costs. Deployment is further constrained by concerns regarding the trustworthiness of compressed models in safety-critical or adversarial settings. Existing benchmarks (such as those for SmoothQuant, SparseGPT, or OmniQuant) typically report only theoretical savings (e.g., MACs or parameter counts), test on a limited selection of models, or exclude robust multi-dimensional evaluations.

LLMCBench aims to fill these gaps through:

  • A unified evaluation protocol quantifying accuracy retention, generality, compression and inference costs, real-world hardware acceleration, and core aspects of trustworthiness.
  • Coverage of multiple LLM families, sizes, and popular deployment runtimes.
  • Actionable, deployment-oriented scoring for practitioners, delivering insights into compression selection under concrete production constraints (Yang et al., 2024).

2. Evaluation Tracks and Design

LLMCBench introduces a six-track evaluation scheme; each track targets a distinct aspect of the compression/deployment landscape and defines a quadratic mean–based overall metric (OM_track) for fair aggregate scoring. The tracks are:

  1. Compression Performance (OM_perf):

    • Measures retention of task accuracy across knowledge (MMLU, ARC) and reasoning/inference (HellaSwag, PIQA, WinoGrande, QNLI, MNLI, WikiText2) datasets post-compression.
    • Metric:

    OMperf=1001Ni=1N(Em,d[AicAi])2\mathrm{OM_{perf}} = 100 \sqrt{\frac{1}{N} \sum_{i=1}^N\left(\mathbb{E}_{m,d}\left[\frac{A^c_i}{A_i}\right]\right)^2}

    where Aic/AiA^c_i/A_i is compressed/uncompressed accuracy on ability ii.

  2. Generalization Ability (OM_gen):

    • Assesses accuracy preservation across model families and scale.
    • Metric:

    OMgen=1001Mj=1M(Es[Aj,sc/Aj,s])2\mathrm{OM_{gen}} = 100 \sqrt{\frac{1}{M}\sum_{j=1}^M (\mathbb{E}_s [A^c_{j,s}/A_{j,s}])^2}

    with jj indexing families, ss sizes.

  3. Training Consumption (OM_train):

    • Combines compression runtime (GPU-minutes) and peak GPU memory.
    • Metric:

    OMtrain=10012(E[TmaxTc]2+E[MmaxMc]2)\mathrm{OM_{train}} = 100 \sqrt{\frac{1}{2}\left(\mathbb{E}\left[\frac{T_{\max}}{T^c}\right]^2 + \mathbb{E}\left[\frac{M_{\max}}{M^c}\right]^2\right)}

  4. Inference Consumption (OM_inf):

    • Tracks inference GPU memory, model size, and MACs.
    • Metric:

    OMinf=10013(E[Mc/M])2+(E[Sc/S])2+(E[Fc/F])2\mathrm{OM_{inf}} = 100 \sqrt{\frac{1}{3}(\mathbb{E}[M^c/M])^2 + (\mathbb{E}[S^c/S])^2 + (\mathbb{E}[F^c/F])^2}

  5. Hardware Acceleration (OM_hard):

    • Evaluates real inference speedup on TensorRT-LLM, vLLM, and MLC-LLM.
    • Metric:

    OMhard=1001Lk=1L(E[VkcVk])2\mathrm{OM_{hard}} = 100 \sqrt{\frac{1}{L} \sum_{k=1}^L \left(\mathbb{E}\left[\frac{V^c_k}{V_k}\right]\right)^2 }

  6. Trustworthiness (OM_trust):

    • Combines robustness (AdvGLUE) and truthfulness (TruthfulQA).
    • Metric:

    OMtrust=10012(E[Arobc/Arob]2+E[Atruc/Atru]2)\mathrm{OM_{trust}} = 100 \sqrt{\frac{1}{2}\left(\mathbb{E}[A^c_{\rm rob}/A_{\rm rob}]^2 + \mathbb{E}[A^c_{\rm tru}/A_{\rm tru}]^2 \right)}

This design ensures that all crucial efficiency, reliability, and scalability criteria are assessed empirically under end-to-end deployment conditions (Yang et al., 2024).

3. Metrics, Compression Methods, and Normalization

Beyond the OM_track metrics, baseline statistics such as compression ratio (CR=sizedense/sizecompressed)(\textrm{CR} = \textrm{size}_{\textrm{dense}} / \textrm{size}_{\textrm{compressed}}) and raw accuracy drop (ΔA=AdenseAcompressed)(\Delta A = A_{\textrm{dense}}-A_{\textrm{compressed}}) are reported, although evaluation prioritizes quadratic-mean normalized metrics to avoid outlier bias and facilitate comparison across heterogeneous model types.

Compression methods evaluated within LLMCBench include:

  • Sparsity-based approaches: LLM-Pruner (structured), Wanda and SparseGPT (unstructured and block-wise).
  • Quantization methods: GPTQ (W8A16, W4A16), AWQ (W8A16, W4A16), SmoothQuant (W8A8, W4A4), and OmniQuant (W8A8, W4A4).

Quantization strategies focus on both weight-only and weight-plus-activation schemes; sparsity techniques include both structured (e.g., 50% row/column pruning) and unstructured (arbitrary element removal) patterns. The benchmark also standardizes bit-width and data format selection, hardware backend (PyTorch with Nvidia A800), and inference environment (TensorRT-LLM, vLLM, MLC-LLM) (Yang et al., 2024).

4. Experimental Results and Comparative Analysis

LLMCBench presents quantitative results for each track, illustrating the trade-offs and practical impact of different compression techniques:

Method OM_perf OM_inf OM_train OM_hard OM_trust
OmniQuant 99.4 246.3 15.2 232 98.5
AWQ 99.4 245.1 18.7 230 97.8
GPTQ 98.6 245.9 25.3 228 97.2
SmoothQuant 97.0 220.6 80.4 205 99.1
LLM-Pruner 68.6 161.9 12.5 155 94.1
Wanda 86.8 134.8 42.8 100 89.3
SparseGPT 88.3 134.8 45.6 102 87.9

Key comparative insights:

  • Quantization (OmniQuant, AWQ, GPTQ) achieves superior OM_perf, OM_inf, OM_hard, and OM_trust compared to sparsity approaches.
  • Structured sparsity (LLM-Pruner) produces substantial inference speedups with moderate OM_perf and OM_trust, but unstructured/block sparsity underperforms unless deployed using specific runtime libraries (notably, 2:4 block patterns gain from TensorRT-LLM support).
  • Activation-aware quantization (SmoothQuant, OmniQuant) yields marginal improvements in trustworthiness and, in some settings, knowledge retention, albeit at the cost of higher compression time and GPU memory during training.

A direct implication is that quantization is the method of choice for maximizing either accuracy or deployment speed, while weight-only schemes generalize better across architectures due to relative stability in weights versus activations (Yang et al., 2024).

LLMCBench distills empirically validated, actionable guidelines:

  • For maximum accuracy retention, prefer AWQ or GPTQ with W4A16 quantization.
  • For maximum speedup, deploy INT4 quantized models with TensorRT-LLM or vLLM.
  • For heavy memory constraints during compression, choose SmoothQuant or AWQ, both of which avoid retraining overhead.
  • When robustness and truthfulness are paramount, weight-plus-activation PTQ (e.g., OmniQuant, SmoothQuant) offers slight but consistent advantages.

The benchmark also demonstrates that mixing sparsity and quantization can yield additive benefits in some settings, subject to hardware/software support. LLMCBench thus provides a reproducible reference for the compression community and facilitates transparent comparison across approaches and use-cases (Yang et al., 2024).

6. Limitations and Future Directions

Current limitations include restriction to seven “star-rated” open-source methods, with plans to expand to KV-cache compression, distillation, vision/language tasks, and code/math benchmarks. Open questions encompass optimal library support for dynamic sparsity, unification of mixed-format hardware primitives, and composition of distillation with post-training quantization (PTQ). The authors note potential for extending LLMCBench toward full-spectrum model and pipeline compression evaluation (Yang et al., 2024).

7. Significance for LLM Compression Research

LLMCBench constitutes the first end-to-end, multi-track benchmarking suite quantifying LLM compression under practically relevant, multidimensional metrics. Its exhaustive protocol, breadth of evaluated methods and models, and attention to deployment constraints position it as a canonical reference for methodologists and practitioners selecting compression pipelines for real-world LLM deployment, setting a new standard for rigorous evaluation in the field (Yang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLMCBench.