ACEBench: LLM Tool-Use Benchmark

Updated 17 September 2025

ACEBench is a comprehensive benchmark that assesses LLM tool usage competence across atomic API calls, ambiguous instructions, and multi-turn agentic tasks.
It employs automated AST parsing, rule-based checks, and sandbox simulation to deliver precise metrics on end, process, and overall accuracy.
The benchmark's granular categorization and detailed error analysis guide targeted fine-tuning and advances in agentic intelligence research.

ACEBench is a comprehensive benchmark specifically designed to assess tool usage competence in LLMs, with a focus on both atomic API calls and complex multi-turn, agentic interactions. It introduces three principal evaluation categories—Normal, Special, and Agent—thereby enabling granular assessment across a wide spectrum of realistic tool-using scenarios. The benchmark covers nearly 2,000 annotated evaluation items and pools approximately 4,500 APIs from eight distinct domains, supporting rigorous, context-aware model evaluation. The framework integrates automated AST parsing, rule- and model-based evaluation protocols, and a sandboxed execution environment, allowing sensitive measurement of tool invocation skill, robustness to ambiguous instructions, and the orchestration of multi-step agentic tasks. ACEBench has become a central testbed for recent advances in agentic LLMs, such as Kimi K2 and AgentScaler, revealing nuanced model performance profiles and informing future directions in agentic intelligence.

1. Benchmark Structure and Evaluation Design

ACEBench organizes its test data into three main evaluation categories:

Normal: Assesses basic tool-calling competence in single-turn and multi-turn scenarios. Subtypes include atomic cases (focusing on specific parameter types), multi-turn dialogue flows, similar API discrimination, and preference-based selection that takes user history or profile into account.
Special: Evaluates handling of ambiguous, under-specified, or error-prone user instructions. Scenarios include missing required parameters (Lose Param), incorrect parameter formats or values (Error Param), and user requests that exceed the functional capacity of any available tool (Exceed Function Ability).
Agent: Simulates complex, multi-agent, multi-turn tasks reflecting real-world decision-making and tool-orchestration. Examples include ordering takeout, managing reminders, and executing multi-step processes requiring context preservation and rigorous confirmation.

Evaluation integrates AST parsing for functional output decomposition, explicit parameter/type checking, and process-level tracking. Metrics include:

End Accuracy: Fraction of tasks where the final outcome is fully correct.
Process Accuracy: Proportion of steps matching an ideal event sequence.
Overall Accuracy: Weighted sum, calculated as:

$\text{Overall Accuracy} = A \cdot \text{Acc}_{\text{Normal}} + B \cdot \text{Acc}_{\text{Special}} + C \cdot \text{Acc}_{\text{Agent}}$

Here, $A$ , $B$ , $C$ are weights proportional to the square root of sample counts per category.

The dataset is pooled from a synthetic API corpus constructed via hierarchical API context trees and multi-stage scenario generation. Data verification combines rule-based checks, automatic model judgments, and expert annotation.

2. Evaluation Scenarios and Error Dimensions

ACEBench encompasses a broad array of evaluation settings:

Atomic Parameter Cases: Precise assessment of data types (enums, numbers, lists, booleans, objects).
Similar API Selection: Tests the model’s discrimination between nearly identical tool specifications.
Preference Retrieval: Requires the model to incorporate contextual user information in API selection.
Multi-turn Agentic Tasks: Requires context-sensitive orchestration of tool calls and state memory across turns.
Ambiguity and Error Handling: Challenges include under-specified inputs and out-of-distribution requests.

Error analysis reveals specific failure modalities:

Data Type	Dominant Error Type	Description
Normal	Wrong Value Errors	Correct API and type, incorrect parameter value
Special	Error Blindness	Failure to detect or report incomplete/invalid input
Agent	Context/Process Violation	Incorrect sequencing, skipped steps, poor integration

Additional error phenomena include AST parsing inconsistencies and vague error reporting.

3. Comparative Experimental Findings

Extensive evaluation demonstrates significant spread in performance among open-source and proprietary LLMs:

GPT-4 models consistently lead in overall accuracy across all ACEBench categories.
Open-source models such as Qwen2.5 variants exhibit narrowing performance gaps, especially in atomic tool-calling.
Special and Agent categories pose lower accuracy rates, exposing weaknesses in error awareness and process management.
Fine-tuned models display relative gains in Normal data, but lose generality in Special scenarios.
Realistic, long-context, multi-turn dialogues in Agent tracks remain challenging for all but the largest models.

Experimental results highlight that, even when most models master basic API format and parameter type, achieving precise value generation and robust error handling is nontrivial. Process Accuracy on multi-step agentic tasks is frequently below 50%, a limiting factor for deployment in real-world orchestration.

4. Impact on Agentic Intelligence Research and Applications

ACEBench has set a stringent standard for the identification and measurement of agentic and tool-use capabilities in LLMs:

Reveals granular insight into error mechanisms, guiding targeted fine-tuning and more robust code-based adaptation strategies.
Enables robust cross-model comparison by decoupling API invocation skill, natural language integration, and process management.
Directly influences the design and post-training reinforcement tuning of modern agentic models, including Kimi K2 and AgentScaler (Team et al., 28 Jul 2025, Fang et al., 16 Sep 2025).
Demonstrates that specialized agentic reinforcement learning frameworks (e.g., MUA-RL (Zhao et al., 26 Aug 2025)) driven by LLM-simulated user feedback yield marked improvements in multi-turn dialogue performance (e.g., 82.5 on ACEBench Agent).
Underscores the persistent challenge in scaling agentic reasoning to long-horizon tool-chaining tasks, informing future curriculum and environment design.

The benchmark’s automated evaluation reduces dependence on costly real API executions and large model-based judgment, further enabling scalable research and efficient development cycles.

5. Methodological Innovations and Benchmarking Protocols

ACEBench advances benchmarking methodology with:

Synthetic API Pool Construction: Hierarchical tree synthesis and dependency graph clustering produce diverse domains and realistic function-calling scenarios.
Automated Data Verification: Combined rule-based and model-based quality control, with human annotation to ensure correctness.
AST-Based Output Parsing: Consistency checks for code-based tool invocation and parameter matching.
Sandboxed Multi-Turn Simulation: Realistic agentic interaction pipelines to simulate multi-step processes, verified outcome matching, and event tracking.
Weighted Metric Aggregation: Statistical weighting to normalize across heterogeneous evaluation volumes.

The benchmark’s granular categorization of scenarios affords fine-grained performance analysis, while its comprehensive coverage facilitates diagnostic error tracing and model profiling.

6. Future Directions and Ongoing Challenges

Ongoing and future research focuses on:

Scaling Scenario Diversity: Expanding the benchmark with new domains, API types, and increasingly sophisticated user interaction protocols.
Adaptive Benchmarking: Incorporating ideas from active learning for dynamic task generation and more responsive capability assessment (Afkanpour et al., 22 May 2025).
Long-Horizon Process Evaluation: Designing evaluation protocols robust to extended tool-call chains, real-time database state mutation, and context dependency.
Agentic Data Synthesis: Leveraging larger, more intricate pipelines for generating tool-use demonstrations and agent trajectories.
Cross-lingual and Cross-domain Generalization: Extending ACEBench to multilingual and out-of-distribution scenarios to test broader agentic intelligence (Fang et al., 16 Sep 2025).

Persistent challenges remain in precisely evaluating and improving contextual integration, confirmation reasoning, and recovery from ambiguous or incomplete instructions.

ACEBench represents a pivotal advance in LLM tool-use evaluation, serving as both a rigorous scientific instrument for benchmarking agentic capabilities and a guidepost for future innovation in interactive, real-world AI systems. Its structure, protocols, and wide adoption have positioned it at the core of agentic intelligence research, accelerating progress toward robust deployment-ready models.