LLMCBench: Benchmark for LLM Compression
- LLMCBench is a multi-track framework that benchmarks LLM compression techniques by measuring accuracy retention, efficiency, and trustworthiness.
- It employs well-defined metrics, such as OM_perf and OM_inf, to offer actionable insights for practical deployment across diverse models and datasets.
- The benchmark addresses limitations of previous studies by including evaluations of hardware acceleration, resource costs, and robust post-compression performance.
LLMCBench is a comprehensive, rigorously designed benchmarking framework for evaluating compression techniques applied to LLMs, with a focus on practical deployment requirements. It is explicitly intended to address the deficiencies of prior work, which often limit evaluation to idealized metrics, a small selection of models or datasets, or omit critical components such as trustworthiness and resource costs. LLMCBench provides an end-to-end protocol—empirical, transparent, and extensible—for quantifying the trade-offs between model compression, inference efficiency, robustness, generalizability, hardware acceleration, and trustworthiness, using well-defined multi-track metrics and unified evaluation settings across a diverse set of models, datasets, and real-world deployment stacks (Yang et al., 2024).
1. Motivation and Benchmarking Scope
LLMs (e.g., GPT, LLaMA) achieve advanced results in knowledge and reasoning tasks but are hampered by massive storage footprints, high GPU memory demands at inference, and burdensome fine-tuning/compression costs. Deployment is further constrained by concerns regarding the trustworthiness of compressed models in safety-critical or adversarial settings. Existing benchmarks (such as those for SmoothQuant, SparseGPT, or OmniQuant) typically report only theoretical savings (e.g., MACs or parameter counts), test on a limited selection of models, or exclude robust multi-dimensional evaluations.
LLMCBench aims to fill these gaps through:
- A unified evaluation protocol quantifying accuracy retention, generality, compression and inference costs, real-world hardware acceleration, and core aspects of trustworthiness.
- Coverage of multiple LLM families, sizes, and popular deployment runtimes.
- Actionable, deployment-oriented scoring for practitioners, delivering insights into compression selection under concrete production constraints (Yang et al., 2024).
2. Evaluation Tracks and Design
LLMCBench introduces a six-track evaluation scheme; each track targets a distinct aspect of the compression/deployment landscape and defines a quadratic mean–based overall metric (OM_track) for fair aggregate scoring. The tracks are:
- Compression Performance (OM_perf):
- Measures retention of task accuracy across knowledge (MMLU, ARC) and reasoning/inference (HellaSwag, PIQA, WinoGrande, QNLI, MNLI, WikiText2) datasets post-compression.
- Metric:
where is compressed/uncompressed accuracy on ability .
- Generalization Ability (OM_gen):
- Assesses accuracy preservation across model families and scale.
- Metric:
with indexing families, sizes.
- Training Consumption (OM_train):
- Combines compression runtime (GPU-minutes) and peak GPU memory.
- Metric:
- Inference Consumption (OM_inf):
- Tracks inference GPU memory, model size, and MACs.
- Metric:
- Hardware Acceleration (OM_hard):
- Evaluates real inference speedup on TensorRT-LLM, vLLM, and MLC-LLM.
- Metric:
- Trustworthiness (OM_trust):
- Combines robustness (AdvGLUE) and truthfulness (TruthfulQA).
- Metric:
This design ensures that all crucial efficiency, reliability, and scalability criteria are assessed empirically under end-to-end deployment conditions (Yang et al., 2024).
3. Metrics, Compression Methods, and Normalization
Beyond the OM_track metrics, baseline statistics such as compression ratio and raw accuracy drop are reported, although evaluation prioritizes quadratic-mean normalized metrics to avoid outlier bias and facilitate comparison across heterogeneous model types.
Compression methods evaluated within LLMCBench include:
- Sparsity-based approaches: LLM-Pruner (structured), Wanda and SparseGPT (unstructured and block-wise).
- Quantization methods: GPTQ (W8A16, W4A16), AWQ (W8A16, W4A16), SmoothQuant (W8A8, W4A4), and OmniQuant (W8A8, W4A4).
Quantization strategies focus on both weight-only and weight-plus-activation schemes; sparsity techniques include both structured (e.g., 50% row/column pruning) and unstructured (arbitrary element removal) patterns. The benchmark also standardizes bit-width and data format selection, hardware backend (PyTorch with Nvidia A800), and inference environment (TensorRT-LLM, vLLM, MLC-LLM) (Yang et al., 2024).
4. Experimental Results and Comparative Analysis
LLMCBench presents quantitative results for each track, illustrating the trade-offs and practical impact of different compression techniques:
| Method | OM_perf | OM_inf | OM_train | OM_hard | OM_trust |
|---|---|---|---|---|---|
| OmniQuant | 99.4 | 246.3 | 15.2 | 232 | 98.5 |
| AWQ | 99.4 | 245.1 | 18.7 | 230 | 97.8 |
| GPTQ | 98.6 | 245.9 | 25.3 | 228 | 97.2 |
| SmoothQuant | 97.0 | 220.6 | 80.4 | 205 | 99.1 |
| LLM-Pruner | 68.6 | 161.9 | 12.5 | 155 | 94.1 |
| Wanda | 86.8 | 134.8 | 42.8 | 100 | 89.3 |
| SparseGPT | 88.3 | 134.8 | 45.6 | 102 | 87.9 |
Key comparative insights:
- Quantization (OmniQuant, AWQ, GPTQ) achieves superior OM_perf, OM_inf, OM_hard, and OM_trust compared to sparsity approaches.
- Structured sparsity (LLM-Pruner) produces substantial inference speedups with moderate OM_perf and OM_trust, but unstructured/block sparsity underperforms unless deployed using specific runtime libraries (notably, 2:4 block patterns gain from TensorRT-LLM support).
- Activation-aware quantization (SmoothQuant, OmniQuant) yields marginal improvements in trustworthiness and, in some settings, knowledge retention, albeit at the cost of higher compression time and GPU memory during training.
A direct implication is that quantization is the method of choice for maximizing either accuracy or deployment speed, while weight-only schemes generalize better across architectures due to relative stability in weights versus activations (Yang et al., 2024).
5. Recommended Methodologies and Practical Guidance
LLMCBench distills empirically validated, actionable guidelines:
- For maximum accuracy retention, prefer AWQ or GPTQ with W4A16 quantization.
- For maximum speedup, deploy INT4 quantized models with TensorRT-LLM or vLLM.
- For heavy memory constraints during compression, choose SmoothQuant or AWQ, both of which avoid retraining overhead.
- When robustness and truthfulness are paramount, weight-plus-activation PTQ (e.g., OmniQuant, SmoothQuant) offers slight but consistent advantages.
The benchmark also demonstrates that mixing sparsity and quantization can yield additive benefits in some settings, subject to hardware/software support. LLMCBench thus provides a reproducible reference for the compression community and facilitates transparent comparison across approaches and use-cases (Yang et al., 2024).
6. Limitations and Future Directions
Current limitations include restriction to seven “star-rated” open-source methods, with plans to expand to KV-cache compression, distillation, vision/language tasks, and code/math benchmarks. Open questions encompass optimal library support for dynamic sparsity, unification of mixed-format hardware primitives, and composition of distillation with post-training quantization (PTQ). The authors note potential for extending LLMCBench toward full-spectrum model and pipeline compression evaluation (Yang et al., 2024).
7. Significance for LLM Compression Research
LLMCBench constitutes the first end-to-end, multi-track benchmarking suite quantifying LLM compression under practically relevant, multidimensional metrics. Its exhaustive protocol, breadth of evaluated methods and models, and attention to deployment constraints position it as a canonical reference for methodologists and practitioners selecting compression pipelines for real-world LLM deployment, setting a new standard for rigorous evaluation in the field (Yang et al., 2024).