LLMs as Intelligent Agents
- LLMs as intelligent agents are systems that integrate large language models with perception, reasoning, and action modules to autonomously interact with their environment.
- They employ modular architectures—including memory management, tool integration, and self-reflection—to execute long-horizon planning and coordinated multi-agent tasks.
- Evaluation metrics focus on task success, safety, efficiency, and real-world adaptability, highlighting challenges in scalability and trust in complex applications.
LLMs as intelligent agents constitute a central paradigm in contemporary AI, blending large-scale neural networks with explicit perception–reasoning–action loops, external memory and tool-use, context modeling, and—often—multi-agent collaboration. Unlike passive completion engines, agentic LLMs are architected to observe and act within an environment, pursue long-term objectives, coordinate, adapt, and, in advanced cases, demonstrate contingent social reasoning and self-reflection. This article surveys the theoretical foundations, architectural mechanisms, applied domains, evaluation frameworks, and emergent properties of LLMs as intelligent agents, synthesizing results from fundamental surveys, technical frameworks, and applied case studies spanning the scientific, industrial, and social domains.
1. Formal Definitions and Theoretical Foundations
The agentic LLM formalism abstracts the model as an interactive system embedded within a Markov Decision Process (MDP) or Partially Observable MDP (POMDP) (Mehandru et al., 2023, Wu et al., 2023, Cheng et al., 2024). An LLM-based agent is operationally defined by the quintuple
where is the LLM (with inference settings), is the objective or final goal, the agent’s internal memory, the set of actions (including tool/API calls), and the “Rethink” or self-reflection module after each action (Cheng et al., 2024).
The agent’s policy maps current state (comprising recent observations, memory, and possibly external feedback) to an action , either via direct generation or via structured tool invocation (Ren et al., 31 Mar 2025). The core agent loop is:
- Perceive the environment and update state/memory.
- Plan the next action using , often with in-context reasoning (Chain-of-Thought/CoT).
- Execute the action (text, API/tool call, communication).
- Receive environment feedback, self-reflect, and update .
- Repeat until the objective is met.
Theoretical frameworks such as the Unified Mind Model (UMM) (Hu et al., 5 Mar 2025) formalize LLM agents within cognitive architecture, drawing inspiration from the Global Workspace Theory to integrate multiple specialist modules (“unconscious experts”), central processing (planning, broadcasting), and a driver system (motivation, long-term goals).
2. Architectural Patterns and Agent Components
Agentic LLM systems are modular, with distinctive components (Yang et al., 2024, 2505.16120, Ren et al., 31 Mar 2025):
- Interaction Wrapper: Ingests external stimuli (text, tools, sensor data, user input) and formats outgoing actions for the environment.
- Memory Management: Combines short-term (local conversation/history or context window) and long-term (vector databases, knowledge graphs, episodic logs) memory (Hu et al., 5 Mar 2025, Han et al., 1 Jul 2025).
- Reasoning/Planning Module: Employs prompting strategies (CoT, ToT), formal planners, or external symbolic planners (PDDL, MCTS) to decompose tasks, schedule tool calls, and enforce long-horizon consistency (Cheng et al., 2024).
- Tool Integration: Provides an interface to invoke external APIs, code execution environments, databases, business logic, or even other agents (Yang et al., 2024, Loffredo et al., 14 Mar 2025).
- Self-reflection/Rethink: Introspective routines for evaluating recent output, correcting mistakes, or self-improvement, leveraging in-context learning, explicit reflection modules, or reward signals (Cheng et al., 2024, Hu et al., 5 Mar 2025).
- Action Execution: Dispatches final outputs as text, function calls, hardware commands, or agent-to-agent messages.
- Feedback Loop: Monitors environmental and outcome signals to adapt behavior, update memory, or trigger further reasoning.
These modules are orchestrated within software (“digital sandbox” tool pipelines), physical (sensor–actuator loop), or adaptive hybrid (multimodal, feedback-driven) environments, as detailed in (2505.16120).
3. Tool Use, Multi-Agent Orchestration, and Collective Behavior
LLMs as intelligent agents transcend symbol manipulation by integrating with external toolchains and multi-agent systems:
- Tool Invocation: The agent identifies, parameterizes, and executes external tools/APIs (e.g., via function-calling, code generation, RESTful calls) (Yang et al., 2024, Loffredo et al., 14 Mar 2025). These can include information retrieval (RAG), mathematical or statistical computations, hardware controllers, and custom analysis modules. Advanced instantiations recursively treat tools as agents, forming multi-agent graphs (Yang et al., 2024).
- Dynamic Task Decomposition: Coordinating agents employ chain-of-thought to segment complex instructions into sub-tasks and route them to appropriate specialist agents or tools, implementing hierarchical delegation (Talebirad et al., 2023, Yang et al., 2024).
- Coordination and Specialization: Multi-agent systems (MAS) leverage modular roles—creator, planner, executor, supervisor, reviewer—with explicit message passing and coordination protocols. The LaMAS protocol (Yang et al., 2024) stipulates layers for instruction parsing, message exchange, consensus/voting, credit allocation (Shapley-style), experience sharing, privacy, and business incentives.
- Collective Intelligence: Emergent behavior arises from team-based specialization, credit-driven collaboration, and decentralized problem decomposition. Autonomy, proactiveness, reactivity, and social ability are formalized as properties of these agentic ecosystems (Yang et al., 2024, Talebirad et al., 2023).
The AIOS-Agent Ecosystem (Ge et al., 2023) extends this analogy: the LLM acts as OS kernel (planning, scheduling, resource allocation), its context window as memory, the tool suite as peripherals, and agent applications as user-level processes.
4. Domain-Specific Agent Applications and Benchmarks
LLM agents are deployed in a spectrum of domains:
- Scientific Discovery: Scientific agents integrate domain-specific tools (simulators, analysis libraries), external knowledge bases, and bespoke planners. Agentics enable hypothesis generation, experiment design, simulation, and data analysis, with explicit protocols for reproducibility, statistical validation, and audit trails (Ren et al., 31 Mar 2025).
- Urban and Industrial Systems: Urban LLM agents interleave multi-modal perception (geospatial, time series, social), structured memory, and spatio-temporal reasoning for urban planning, traffic control, energy optimization, and crisis response (Han et al., 1 Jul 2025). Manufacturing systems orchestrate negotiation and scheduling via prompt-engineered TA/DA agents, optimizing makespan reductions (Zhao et al., 2024).
- Data-Analyst Agents: Agentic architectures for data analysis target semantic-aware, multi-modality integration, autonomous pipelining, tool-augmented reasoning, and open-world adaptation, spanning structured, semi-structured, unstructured, and heterogeneous data (Tang et al., 28 Sep 2025).
- Conversational and Social Agents: Integration of Theory of Mind (ToM) modules improves goal-directed dialog and social strategy, as in ToMAgent, where mental-state reasoning and dialogue lookahead reinforce relationship maintenance and long-horizon planning (Hwang et al., 26 Sep 2025).
- Security and Compliance: Multi-agent LLM frameworks (AutoGen-based) proactively detect and mitigate security vulnerabilities by combining policy-driven RAG, dedicated security/business agents, and closed-loop validation for regulatory compliance (e.g., OWASP Top 10) (Fasha et al., 26 Jan 2026).
- Benchmarking: Methodological rigor emerges in agent benchmarks such as SmartPlay (Wu et al., 2023), which models LLMs as agents in POMDP game environments to dissect nine capabilities: long-text reasoning, planning, rule following, generalization, learning from interaction, error recovery, and spatial reasoning. AgentBench and MLAgentBench provide additional broad-spectrum testbeds.
5. Evaluation, Challenges, and Best Practices
Robust evaluation of LLM agents is domain-specific and multi-factorial:
- Task Success and Pipeline Accuracy: Metrics include per-action accuracy, multi-turn robustness, tool-calling correctness, dialogue coherence, planning efficiency, and pipeline-level task accomplishment (Arcadinho et al., 2024, Ren et al., 31 Mar 2025).
- Safety, Trust, and Security: Agentic safeguards span input/output validation (via security agents), policy citation enforcement, context and action audits, red-teaming, and guardrails for high-stakes generation (Fasha et al., 26 Jan 2026).
- Latency and Efficiency: Transformer-based inference latency necessitates model compression, distillation, kernel optimization, and intelligent caching (2505.16120).
- Hallucination Control: Retrieval-augmented generation, schema validation, and multi-turn verification mitigate ungrounded or spurious outputs (2505.16120, Arcadinho et al., 2024).
- Memory and Long-Horizon Reasoning: Persistent, structured, and hierarchical memory architectures address context limits and enable knowledge accumulation across task sessions (Hu et al., 5 Mar 2025, Ren et al., 31 Mar 2025).
- Continual Learning and Adaptability: Open-world agents require continual adaptation, online feedback incorporation, and OOD-safe exploration routines (Tang et al., 28 Sep 2025).
- Ethics and Accountability: Deployed agents include audit logs, human-in-the-loop mechanisms, privacy-preserving data handling, explicit value alignment, and periodic human audit for bias or harm (Ren et al., 31 Mar 2025, 2505.16120, Ge et al., 2023).
- Design Guidelines: Modularization, structured prompt engineering, explicit tool orchestration, closed-loop monitoring, privacy/guardrail layers, and metric-driven iterative development undergird reproducible, scalable deployment (2505.16120, Cheng et al., 2024, Ge et al., 2023).
6. Open Problems and Future Directions
Despite rapid progress, several challenges persist:
- Modular and Scalable Memory: Explicit cross-agent, persistent, and event-driven memory hierarchies for long-horizon reasoning (Ge et al., 2023, Hu et al., 5 Mar 2025).
- Structured Communication Protocols: Efficient, reliable multi-agent language, including DSL and semi-structured messaging, to maintain coherence at scale (Yang et al., 2024, Ge et al., 2023).
- Safe and Trustworthy Tool Integration: Static and dynamic analysis routines for prompt-injected tools, agent sandboxes, and robust runtime auditing (Fasha et al., 26 Jan 2026, Ge et al., 2023).
- Efficient and Generalizable Learning: RL paradigms leveraging code execution, fidelity metrics, and meta-learning for rapid adaptation, modular skill transfer, and improved credit assignment in decentralized setups (Yang et al., 2024, Yang et al., 2024).
- Complex Social and Multi-Modal Reasoning: Native ToM, multi-level belief tracking, and direct multimodal input for socially aware, multi-sensory, and physically grounded agents (Hu et al., 5 Mar 2025, Hwang et al., 26 Sep 2025, Han et al., 1 Jul 2025).
- Unified Benchmarking and Evaluation: Development of foundational benchmarks that test coupled tool use, memory, planning, and reasoning across modalities and environments (Wu et al., 2023, Tang et al., 28 Sep 2025).
LLM-based intelligent agents thus synthesize pre-trained neural architectures, modular planning components, tool/collaboration protocols, and adaptive feedback mechanisms within unified frameworks. These systems, evaluated via realistic multi-domain workflows and rigorous agentic benchmarks, promise scalable, generalizable, and socially-aware intelligence, while also requiring principled design for trust, safety, and robustness.