LLM Agent Evaluation Taxonomy

Updated 6 January 2026

Taxonomy of LLM Agent Evaluation is a systematic framework that categorizes assessment along dimensions such as capability, behavior, and safety.
It integrates objective performance metrics with protocol-driven methodologies, including scenario simulations and multi-turn evaluations.
This structured approach standardizes evaluation pipelines, fostering reproducibility and guiding future research on robustness and scalability.

LLM agent evaluation encompasses a diverse and rapidly evolving discipline dedicated to systematically characterizing the competencies, limitations, and behaviors of LLM-driven autonomous systems. As LLM agents progress from static completion engines to dynamic entities capable of adaptive planning, multi-turn conversation, tool-mediated reasoning, and multi-agent collaboration, traditional benchmark-driven “task success” metrics have proven inadequate for capturing the full spectrum of emergent phenomena. In response, recent literature delineates formal taxonomies—organizing agent evaluation along multiple orthogonal axes such as capability, behavior, reliability, safety, scenario specificity, and interaction modality. These frameworks aim to standardize assessment pipelines and enable cross-framework, cross-domain robust comparisons, while probing nuanced factors such as social cognition, consensus dynamics, tool-use fidelity, and failure modes. The following sections synthesize current taxonomies, metrics, protocol formalizations, and scenario-based strategies, referencing pivotal research (Reza, 1 Oct 2025, &&&1&&&, Mohammadi et al., 29 Jul 2025, Ferrag et al., 28 Apr 2025, Luo et al., 27 Mar 2025, Li, 2024, Cemri et al., 17 Mar 2025, Guan et al., 28 Mar 2025).

1. High-Level Taxonomic Dimensions of LLM Agent Evaluation

LLM agent evaluation frameworks consistently partition the assessment space into distinct dimensions representing what is measured and how measurement is performed. The two-dimensional taxonomy introduced by Mohammadi et al. (Mohammadi et al., 29 Jul 2025) and the scenario- and role-driven frameworks of Ferrag et al. (Ferrag et al., 28 Apr 2025) are representative.

Evaluation Objectives (“What”):

Agent Behavior (task completion, output quality, latency)
Agent Capabilities (tool-use, planning, memory, multi-agent coordination)
Reliability (consistency, robustness to perturbation)
Safety & Alignment (fairness, harm/toxicity, compliance)

Evaluation Process (“How”):

Interaction Mode (static, dynamic, continuous in simulators or live environments)
Metric Computation (code-based checks, LLM-as-Judge, human-in-the-loop)
Datasets and Benchmarks (synthetic, real-world, domain-specific)
Tooling & Context (automation frameworks, leaderboards, enterprise sandboxes)

This is further detailed in protocol-centric frameworks that instantiate context-rich multi-agent debates, scenario-driven simulations, and modular tool-integration tests (Reza, 1 Oct 2025, Zhao et al., 25 Aug 2025, Ferrag et al., 28 Apr 2025).

2. Core Metric Families and Formal Definitions

The empirical evaluation of LLM agents is underpinned by a suite of metrics tailored to outcome, behavioral dynamics, system-level efficiency, and psychometric properties.

A. Performance and Task Completion

Success Rate (SR):

$\text{SR} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\text{agent succeeds on task }i\}$

Pass@k: Probability at least one of k runs succeeds.
Average Reward:

$R = \mathbb{E}\Bigl[\sum_{t=0}^{T} r_t \Bigr]$

Tool-Use Accuracy:

$\text{Acc}_{\rm tool} = \frac{\#\text{correct tool calls}}{\#\text{total calls}}$

Final Stance Convergence ( $\mu$ ):

$\mu = \frac{s^{1}_{\text{final}} \cdot s^{2}_{\text{final}}}{\|s^{1}_{\text{final}}\|\|s^{2}_{\text{final}}\|}$

Total Stance Shift ( $\Delta_\text{total}$ ):

$\Delta_{\text{total}}^{i} = 1 - \frac{s^{i}_{\text{initial}} \cdot s^{i}_{\text{final}}}{\|s^{i}_{\text{initial}}\|\|s^{i}_{\text{final}}\|}$

Semantic Diversity ( $D_r$ ):

$D_{r} = \text{avg}_{i < j}[1 - \cos(a^{i}_{r}, a^{j}_{r})]$

Psychometric Profiles: Self-reported scales of Argument Confidence $C$ , Cognitive Effort $E$ , Empathy $T$ , Cognitive Dissonance $D$ .

Latency: $L_{\rm p50}, L_{\rm p90}$
Throughput:

$\tau = \frac{\#\text{actions}}{\text{wall-clock seconds}}$

Preference Rate:

$P = \frac{\#\text{A wins}}{\#\text{comparisons}}$

A multi-metric profile yields a vector-valued agent assessment, supporting protocol-specific analytic pipelines (see Table below).

Metric Category	Exemplars	Protocol/Benchmark Examples
Outcome	SR, $\mu$ , $\Delta_{\text{total}}$	AgentBench, Multi-Judge Debate, SWE-bench
Dynamics	$D_r$ , bias/sentiment, planning	Mind2Web, LongEval, Tree-of-Thoughts
Psychometrics	$C$ , $E$ , $T$ , $D$	Social Laboratory, persona swap protocols
System	Latency, throughput	ChainEval, LangChain microbenchmarks
Human-Centric	P, SUS, E	ChatArena, WebGPT, HumanRankEval

3. Frameworks, Personas, and Protocol Instantiation

Agent evaluation frameworks systematically operationalize agent and moderator personas—prompt templates conferring specific incentives or behavioral patterns (Reza, 1 Oct 2025). This design space forms "evaluation protocols" through factorial combinations:

Debater Personas: Evidence-driven analyst (truth), values-focused ethicist (persuasion), contrarian debater (persistent disagreement)
Moderator Personas: Neutral (impartial arbiter), consensus builder (agreement-fostering)
Debate Length: Number of rounds as an independent variable

Protocols probe constructs such as consensus tendency, persona-induced cognition, adversarial robustness, and environmental alignment. This supports targeted experimental regimes capturing global (final stance, overall agreement), intermediate (per-round diversity, sentiment trajectories), and internal (psychometric states) agent phenomena.

4. Taxonomies for Multi-Agent and Failure Mode Evaluation

Multi-agent agentic systems necessitate failure-mode-centric taxonomies. The MAST framework (Cemri et al., 17 Mar 2025) organizes evaluation along three top-level error categories specific to multi-agent orchestration:

Specification & System Design: Disobey task/role specification, step repetition, loss of conversation history, termination unawareness
Inter-Agent Misalignment: Conversation reset, failure to clarify, task derailment, information withholding, ignored input, reasoning-action mismatch
Task Verification & Termination: Premature termination, incomplete verification, incorrect verification

Automated pipelines ("LLM-as-Judge") use formal decision rules to label traces and benchmark human annotator agreement (Cohen’s $\kappa$ ).

5. Scenario-Specific and Benchmark-Based Taxonomies

Evaluation taxonomies are stratified by domain, modality, and interactivity (Ferrag et al., 28 Apr 2025), with more than sixty benchmarks categorized across eight groups:

Academic/general knowledge reasoning (MMLU, BIG-Bench Extra Hard, HLE)
Mathematical problem solving (MATH, ProcessBench, DABStep)
Code and software engineering (Codex, ComplexFuncBench, SWE-Lancer, CASTLE)
Factual grounding and retrieval (FACTS Grounding, CRAG)
Domain-specific (ZODIAC, LegalAgentBench, MedAgent-Pro)
Multimodal and embodied tasks (GAIA, EmbodiedEval, ENIGMAEVAL)
Task selection and quality (FineTasks)
Agentic and interactive evaluations (MultiAgentBench, τ-bench, Agent-as-a-Judge)

Benchmarks are mapped on modality (text, multimodal, embodied), interactivity (static, interactive, agentic/multi-agent), and domain specificity, pinpointing dataset gaps for future work.

6. Methodological Taxonomies: Aggregating "What" and "How"

Multi-turn conversational evaluation (Guan et al., 28 Mar 2025) employs dual taxonomies: one for agent dimensions, one for method.

Evaluation Goals: Task completion (TSR), response quality (BLEU, ROUGE, METEOR, BERTScore), user experience/safety, memory/context retention (context retention score), planning/tool integration (tool accuracy, hallucination rate).
Methodological Families: Annotation-based (human gold), automated metrics (C-PMI, pairwise scoring), hybrid human–LLM, and self-judging LLM rubrics.

Best-practice frameworks combine automated filtering, specialized benchmarks for memory/tool use, and calibrated human–LLM hybrid scoring, reporting standardized metrics.

7. Challenges, Evolving Benchmarks, and Future Directions

Challenges persist around consistency, cross-domain generalization, scalability, bias in self-judging LLM evaluators, and the need for unified, dynamic evaluation pipelines (Mohammadi et al., 29 Jul 2025, Luo et al., 27 Mar 2025).

Key areas for future research include:

Composite holistic metrics incorporating multi-dimensional weights according to application priorities
End-to-end enterprise-grade benchmarks integrating role-based access control, real policies, and persistent memory
Protocol robustness (MCP, ACP, A2A) against security and privacy threats
Automated scaling via prompt-to-leaderboard (P2L) and adversarial stress protocols
Expanded simulation of long-horizon, multi-agent, multimodal, and domain-specific interaction

These efforts aim to ensure rigorous, reproducible, and scalable evaluation regimes for next-generation LLM agents, fostering robust agent deployment across scientific, industrial, and social applications.