LLMs as Agents: Architectures & Evaluation

Updated 11 October 2025

LLMs as agents are autonomous systems that embed pretrained language models into POMDP frameworks, enabling them to interpret instructions and manage multi-turn interactive tasks.
They employ integrated prompting, chain-of-thought reasoning, and hierarchical decision policies to effectively utilize external tools and execute precise actions.
Empirical studies reveal that closed models excel in maintaining context and strategic planning, while open-source models often face challenges with format compliance and long-horizon reasoning.

LLMs as agents refers to the use of LLMs—pretrained, autoregressive neural networks with instruction-following and reasoning capacities—embedded within interactive environments to autonomously interpret instructions, plan actions, execute decisions, and adapt strategies across diverse, open-ended tasks. Departing from traditional single-turn NLP applications, the agentic paradigm exposes LLMs to multi-turn, stateful scenarios modeled as POMDPs or sequential environments. Here, LLMs must not only generate linguistically plausible responses but also close the loop on environment interaction, actively manipulating simulated systems, external tools, or real devices to achieve user-specified goals or optimize performance under environmental feedback.

1. Conceptual Foundation and Agentic Architecture

LLMs are treated as high-capacity policies within a dynamic environment, formalized as a Partially Observable Markov Decision Process (POMDP) $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \mathcal{U}, \mathcal{O})$ , where LLM output forms the action $a \in \mathcal{A}$ , given observations $o \in \mathcal{O}$ and current context. AgentBench, for example, defines a conversational trajectory as $(u_0, a_0, u_1, a_1, \ldots, u_k)$ , where $u_i$ denotes user or environment utterances and $a_i$ denotes LLM-generated actions, which may include shell commands, SQL queries, tool invocations, or structured outputs (Liu et al., 2023).

Architectural patterns include:

Integrated Prompting and Chain-of-Thought (CoT): The LLM receives rich context including prior messages, observations, and task history, coupled with structured instructions to elicit stepwise, reflective reasoning.
Hierarchical Decision Policies: Agents can comprise a high-level (planning) policy generating temporally extended goals and a low-level controller tasked with selecting atomic actions to satisfy said goals (cf. GLIDER (Hu et al., 26 May 2025), LLM-augmented HRL (Prakash et al., 2023)).
Modular Action Systems: Agentic LLMs frequently rely on action libraries, API schemas, or tool repositories to ground abstract decisions into executable functions—see Global Action Repository and multi-tool setups (Kulkarni, 3 Feb 2025, Loffredo et al., 14 Mar 2025).

2. Evaluation Methodologies and Benchmarking

Robust evaluation of LLM agents requires multi-turn, open-ended environments that demand not only text generation but also strategic, sequential decision-making. AgentBench introduced eight distinct simulated settings categorized into code-grounded (OS, DB, KG), game-grounded (card games, puzzles, household tasks), and web-grounded (shopping, browsing) environments (Liu et al., 2023). Each presents a stateful, interactive API and often enforces output format, action legality, and long-term plan coherence.

Evaluation frameworks typically employ:

Success Rate Metrics: Normalized and weighted averages across task categories to summarize overall ability.
Failure Taxonomies: Detailed logging and classification of Context Limit Exceeded, Invalid Format, Invalid Action, Task Limit Exceeded, and Complete outcomes, mapping failure incidence to underlying shortcomings in decision-making, format compliance, or context management.
Automated Test Pipelines: ALMITA leverages graph-driven conversation sampling to comprehensively assess coverage and robustness to perturbation in customer support workflows, extending beyond shallow function call accuracy to multi-turn procedure adherence (Arcadinho et al., 24 Sep 2024).

3. Empirical Performance Characteristics

Empirical studies consistently show that commercial API models (e.g., GPT-4, Claude-2) markedly outperform open-source solutions (e.g., Llama-2-70B) in agentic tasks. For instance, on AgentBench, closed models achieve overall normalized scores above 1.0 while open-source models cluster near 0.5—reflecting deficiencies in instruction following, long-horizon planning, and multi-turn consistency. Performance gaps are particularly acute in environments involving complex tool selection (Knowledge Graph), strategic action selection (card games), or extended reasoning under partial observability (Liu et al., 2023).

Key determinants of failure include:

Context Length Constraints: Models with limited input tokens (e.g., 2,048 tokens) accumulate excessive conversational or environmental state, leading to incomplete histories and loss of critical information.
Format and Action Compliance: Rigid output requirements (e.g., SQL or shell command syntax) expose LLM weaknesses in faithful, template-based generation under noisy or long context.
Decision-Making Under Uncertainty: Multi-turn and partially observable settings challenge model ability to plan, reconsider, or correct trajectories in face of feedback.

4. Strategies for Improving Agentic LLMs

Several orthogonal strategies have been identified to enhance agentic performance:

Code and Multi-Turn Alignment Data: Training on high-quality procedural and code datasets increases task fidelity, e.g., CodeLlama exhibits higher completion rates on stepwise environments, although sometimes at the expense of nuanced game strategies (Liu et al., 2023).
Multi-Turn Human Alignment: Fine-tuning on aligned, high-quality dialogues (from advanced models or curated user traces) improves multi-turn consistency, reduces Invalid Action/Format errors, and boosts instruction compliance.
Hierarchical/Hybrid Approaches: For long-horizon or temporally extended tasks, integrating LLMs as high-level planners within hierarchical RL frameworks yields significant sample efficiency improvements, rapid convergence, and, importantly, decouples agent runtime policies from expensive LLM inference at deployment (Prakash et al., 2023, Hu et al., 26 May 2025).
Adaptive Training Environments: Using LLMs to generate and adapt training regimes (e.g., EnvGen) enables lightweight RL agents to target weaknesses, drastically reducing the number of necessary LLM calls and thus cost, while accelerating acquisition of complex skills (Zala et al., 18 Mar 2024).

5. Multi-Domain and Specialized Agentic Contexts

LLM-based agents extend well beyond generic simulated environments. Instances include:

Medical Record Manipulation: MedAgentBench evaluates both retrieval and record-modification (FHIR POST) capabilities in a full EHR simulation, emphasizing format compliance, planning, and error resilience (Jiang et al., 24 Jan 2025).
Political Science Research: CongressRA demonstrates agentic RAG—LLMs equipped with modular toolchains for function calls (e.g., to Congress.gov or vector databases)—to autonomously conduct summarization, variable extraction, and complex metric reporting (Loffredo et al., 14 Mar 2025).
Procedural Automation: Agent-S operationalizes SOPs in e-commerce customer care via text-delimited workflow logs, repository-based action matching, dynamic memory, and multi-agent LLM coordination, showcasing generalizability to diverse operational domains (Kulkarni, 3 Feb 2025).
Knowledge Discovery: LLM agents can autonomously explore experimental simulation black boxes by iteratively proposing, executing, and synthesizing experiments, as shown in atomic layer processing discovery, with a strong influence of path-dependent exploration and trial-and-error (Werbrouck et al., 30 Sep 2025).

6. Limitations, Safety, and Future Challenges

Despite rapid progress, significant obstacles remain:

Poor Handling of Long-Span Reasoning: Even state-of-the-art models struggle with long-term credit assignment, sparse reward exploration, and context retention when faced with multi-step, ill-posed objectives (Liu et al., 2023, Hu et al., 26 May 2025).
Safety and Refusal Robustness: Refusal-trained LLMs can often be jailbroken in agentic settings (notably in browser agents) by context switching or adversarial prompting, dramatically increasing attack success rates relative to chat scenarios; agentic architectures amplify contextual and multi-step vulnerabilities (Kumar et al., 11 Oct 2024).
Evaluation and Benchmark Coverage: There is a lack of standardized, multi-dimensional metrics for assessing agentic success, failure, robustness, and user alignment. Many current benchmarks fail to capture error accumulation in extended, realistic tasks (Arcadinho et al., 24 Sep 2024).
Resource and Efficiency Tradeoffs: Direct LLM agent deployment can be cost- and latency-prohibitive; hybridization, distillation, and environment generation offer mitigation but at the cost of flexibility or final policy optimality (Zala et al., 18 Mar 2024, Prakash et al., 2023).

7. Toolkits, Datasets, and Infrastructure

Research in LLM agents is supported by open-source toolkits and multitask benchmarks, typically providing:

Plug-and-Play Evaluation: Containerized, Docker-based architectures with server-client separation to ensure environment isolation and reproducibility; HTTP-based APIs for agent-environment interaction (e.g., AgentBench, MedAgentBench (Liu et al., 2023, Jiang et al., 24 Jan 2025)).
Task Scheduling Algorithms: Assignment mechanisms (e.g., Edmonds–Karp max-flow for benchmark job scheduling) to coordinate large-scale multi-model evaluation.
Comprehensive Datasets: High-quality, human-authored tasks spanning code execution, games, healthcare, knowledge extraction, and web interaction, structured for cross-model comparison.
Extensibility: Benchmark frameworks are structured for easy addition of new environments, evaluation protocols, or agentic paradigms.

LLMs as agents represent a paradigm shift from isolated language modeling toward the realization of fully autonomous, tool-using reasoning entities capable of acting in dynamic, multi-turn environments. Comprehensive evaluation frameworks and technically grounded agentic architectures are necessary to chart the strengths and ongoing limitations inherent in this approach. Continued progress will require advances in multi-turn alignment, hybrid learning strategies, procedural compliance, and secure deployment protocols to close the gap between current LLM capabilities and robust, general-purpose real-world agents (Liu et al., 2023, Prakash et al., 2023, Kulkarni, 3 Feb 2025, Werbrouck et al., 30 Sep 2025, Kumar et al., 11 Oct 2024).