ACBench: Agent Compression Benchmark
- ACBench is a comprehensive benchmark suite that evaluates the agentic capabilities of large language models after applying post-training compression techniques such as quantization and pruning.
- It measures performance across 12 tasks in domains like action execution, workflow generation, long-context understanding, and real-world application using detailed metrics including F1 score and ERank.
- The benchmark provides actionable guidelines, showing that quantization methods like GPTQ and AWQ preserve multi-turn coordination more effectively than various pruning strategies.
The Agent Compression Benchmark (ACBench) is a comprehensive suite for evaluating the impact of post-training model compression—specifically quantization and pruning—on the agentic capabilities of LLMs. Unlike traditional benchmarks that focus on language modeling or natural language understanding accuracy, ACBench systematically assesses the ability of compressed LLMs to perform multi-step workflows, tool/function usage, long-context reasoning, and real-world action execution. It covers 12 tasks across four core agentic domains and provides actionable metrics for developers of resource-efficient, deployable LLM-based agents (2505.19433).
1. Motivation and Capability Scope
The central premise of ACBench is the inadequacy of traditional compression evaluation metrics such as perplexity or GLUE accuracy for agentic LLMs. Real-world agents built upon LLMs must perform complex, multi-step planning, invoke external tools/API calls, maintain coherent state over extended input contexts, and act robustly in interactive or embodied environments. Existing metrics, focused on next-token prediction or single-turn language understanding, do not measure these behaviors. ACBench fills this evaluative gap by explicitly targeting four agentic capability categories:
- Action Execution (T-Eval): Multi-step tool invocation, reasoning, and planning via textual queries.
- Workflow Generation (WorfBench): Structured workflow/plan generation for problem solving, including function-call chaining and embodied task decomposition.
- Long-Context Understanding: Scenarios requiring reasoning over up to 40k-token contexts, including document retrieval and multi-document question answering.
- Real-World Application (AgentBoard): Autonomous action and planning in simulation environments, games, and tool APIs, measured as progress and success rate.
2. Benchmark Structure, Task Design, and Evaluation
ACBench evaluates agentic performance through a diverse set of tasks and structured evaluation protocols:
- Action Execution (T-Eval): Composed of six skills (plan, reason, retrieve, understand, instruct, review), using 553 queries across ~40 tools. Metric: F1 score over correctly called tools in API invocation sequences (with JSON vs. plain string output differentiated).
- Workflow Generation (WorfBench): Four sub-tasks test graph-based workflow induction from high-level instructions, with ≈2,150 test scenarios. Metric: Subgraph alignment F1 based on graph-edit distance between predicted and reference plans.
- Long-Context Understanding: Includes 12 tasks from LongBench (single- and multi-document QA, summarization, few-shot prompting) and LongGenBench (GSM8K and MMLU with 40k tokens), and the Needle-in-the-Haystack retrieval. Metrics: Exact match (EM), F1, or classification accuracy, task-dependent.
- Real-World Application (AgentBoard): Five tasks spanning embodied AI (ScienceWorld), game environments (Jericho, PDDL), and interactive tool operations. Metrics: Progress rate (fraction of subtasks completed) and success rate (fully completed episodes).
Evaluation adopts task-specific scoring: sequence-level exact match, F1 on retrieval spans, subgraph F1, as well as progress and success rates in interactive settings.
3. Compression Techniques and Model Coverage
ACBench assesses both quantization and pruning methodologies:
- Quantization
- GPTQ: Hessian-aware, post-training quantization to 2, 3, 4, or 8 bits.
- AWQ: Activation-aware quantization with per-group scale calibration.
- SmoothQuant: Rebalances activation/weight range prior to 8-bit quantization.
- Pruning
- Magnitude Pruning (Mag): Unstructured or 2:4 structured pruning by magnitude.
- Wanda: Weight- and activation-driven unstructured or 2:4 pruning.
- SparseGPT: Gradient-informed, rapid pruning (unstructured or 2:4).
Evaluated models span:
- Small (<7B): Gemma-2B, Phi-3.5-3.8B, Megrez-3B, MiniCPM-4B, Qwen2.5-1.5B/3B, DS-Distill-Qwen-1.5B
- Standard (7–32B): InternLM-2.5-7B, Qwen2.5-7B/14B/32B, Mistral-7B
- Distilled LLMs: DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B
All experiments use post-training compression only (no further fine-tuning). Calibration is performed with 128 Pile validation tokens, sequence length 512, and decoding at temperature 0 for determinism.
4. Evaluation Metrics and Statistical Analysis
Three primary statistical metrics supplement task-level accuracy and F1:
- Efficient Rank (ERank): For weight matrix with singular values , , :
.
- Top-k Ranking Consistency (Jaccard) : For original/logits , let be the set of top- tokens,
- Energy Score: For classifier , temperature ,
Empirical correlations indicate that ERank reductions and Energy Score shifts track downstream degradation, while top- ranking consistency (Jaccard, Spearman’s ) correlates with perplexity and agentic task retention. For instance, under 4-bit quantization, ERank typically decreases by 1.4–2.3, and (Jaccard at top-3) falls from around 0.95 to 0.75.
5. Empirical Findings Across Agentic Tasks
ACBench yields detailed empirical trade-offs for quantization and pruning:
- Workflow Generation: 4-bit quantization (GPTQ, AWQ) incurs only 1–3 percentage point F1 decrease. Magnitude pruning may drop >10 points; unstructured Wanda/SparseGPT at 2:4 configuration cost ~5 points.
- Tool Use: On InternLM-7B, AWQ(4-bit) reduces accuracy from 71.4% to 68.6% (–2.8 pp), GPTQ(4-bit) to 71.8% (+0.4 pp), Wanda(un) to 64.7% (–6.7 pp), SparseGPT(un) to 62.2% (–9.2 pp). JSON-structured outputs are more error-prone post-pruning than plain text. Quantization outperforms all pruning strategies.
- Long-Context Reasoning: 4-bit quantization degrades EM/F1 by 1–5 points; Wanda/SparseGPT(un) by 2–7; Mag by >10. On Needle-in-Haystack (40k tokens), quantized models retain ~90% retrieval EM up to 32k context, while pruned models degrade rapidly beyond 20k tokens.
- Real-World Application: 4-bit quantization loses 10–15 points in progress and success vs. FP16. Wanda(un) and SparseGPT(un) preserve 60–70% and ~55% of progress, respectively. Distilled models (DeepSeek-R1-Distill-Qwen/Llama) show near-zero success rates, indicating loss of agentic coordination.
6. Practical Guidelines and Recommendations
Derived from benchmark results, the following actionable practices are recommended for agent-oriented LLM compression:
- For Planning and Tool Use: Employ 4-bit GPTQ or AWQ. F1 reductions remain ≤3%. If pruning is unavoidable, unstructured Wanda at 2:4 sparsity yields the best compromise (F1 drop ~5%).
- For Long-Context Retrieval and Reasoning: Use AWQ(INT4) for optimum performance (≤5 percentage point F1/EM decrease). Avoid magnitude pruning or 2:4 structured sparsity when context exceeds 16k tokens; prefer unstructured Wanda/GPTQ.
- For Interactive Agents: Quantization (AWQ, GPTQ) preserves ~80% of agentic performance. Pruning leads to >30% loss. Do not depend on existing model distillation pipelines, as they eliminate multi-turn coordination capabilities.
- General Best Practices: Calibrate per-group quantization with 128 validation sequences. Maintain temperature=0 and deterministic decoding for reproducible tool call sequences. Use ERank and as proxies for early-stage retention assessment.
7. Broader Significance and Implications
ACBench establishes that post-training 4-bit quantization (especially using GPTQ and AWQ) can reliably decrease memory and computational cost of LLMs while retaining their ability to plan, reason, and operate as agents in real-world workflows. Unstructured pruning (Wanda, SparseGPT) serves as a secondary solution with greater trade-offs; magnitude pruning and naive distillation pipelines are not suitable for agentic or multi-turn deployments. This benchmark provides practitioners with concrete methodologies and metrics focused on the aspects of LLM behavior ignored by perplexity-centric evaluations, enabling more systematic engineering of efficient, reliable LLM-based agents (2505.19433).