Papers
Topics
Authors
Recent
2000 character limit reached

LLM-based Agentic Systems

Updated 31 December 2025
  • LLM-based agentic systems are frameworks that integrate large language models to autonomously execute context-sensitive decisions across diverse domains.
  • They employ multistep reasoning, secure tool integration, and dynamic memory management to handle complex workflows in areas like healthcare and software engineering.
  • Research focuses include robust evaluation metrics, transparent governance, and risk mitigation strategies to enhance both performance and security.

LLM-based agentic systems are computational frameworks that integrate LLMs as their core reasoning engines, directing autonomous, context-sensitive workflows that span tool usage, memory management, and complex decision processes. These systems are deployed across diverse domains—including software engineering, healthcare, scientific discovery, customer service, and more—where they orchestrate multi-modal perception, tool invocation, planning, and iterative adaptation to achieve specified goals, sometimes under human-in-the-loop supervision but increasingly in multi-role, multi-agent, or fully autonomous configurations. The architecture and operational semantics of LLM-based agentic systems, as well as their security, governance, and evaluation regimes, are subject to ongoing high-stakes research due to the profound flexibility and risks these systems exhibit (Liu et al., 6 Sep 2025, Bluethgen et al., 10 Oct 2025, Raza et al., 4 Jun 2025, Bousetouane, 1 Jan 2025, 2505.16120, Li et al., 13 Oct 2025, Zhao et al., 25 Aug 2025, Guo et al., 10 Oct 2025).

1. Fundamental Architecture and Formalization

The defining feature of LLM-based agentic systems is the placement of an autoregressive LLM at the core of a looped interaction between observations, “reason-style” chains-of-thought, tool or API calls, return consumption, and action selection. A typical agentic cycle can be expressed as:

C0Init(PU) ; k0 ; while  ¬Q(Ck,k)  do yk+1=ak(Ck,g,t) ; Ck+1=ak(Ck,yk+1,g,t)\mathcal{C}_0 \gets \operatorname{Init}(P_U)\ ;\ k\gets 0\ ;\ \mathbf{while}\;\neg Q(\mathcal{C}_k, k)\;\mathbf{do}\ y_{k+1}=a_k(\mathcal{C}_k, g, t)\ ;\ \mathcal{C}_{k+1}=a'_k(\mathcal{C}_k, y_{k+1}, g, t)

where Ck\mathcal{C}_k is the agent’s latent context at step kk, aka_k is an action (reasoning, tool-invocation, reflection), gg the current goal, and tt an external tool or module (Zhao et al., 25 Aug 2025). Architectures vary from single-agent, looped LLM-driven decision engines to multi-agent systems (MAS) where numerous agents, each with role- or tool-specific capabilities and memory, coordinate within hierarchical, centralized, or decentralized frameworks. Agentic architectures typically incorporate:

  • An agent core (LLM-based)
  • Tool integration and dynamic tool selection via schemas or adapters
  • Short-term and long-term memory modules (vector stores, structured note graphs, retrieval-augmented generation)
  • Planning/replanning logic (ReAct, ReWOO, OODA loops)
  • Orchestration protocols (e.g., Model Context Protocol (MCP), Agent-to-Agent (A2A))
  • Security, validation, and guardrail layers (runtime schema validation, explicit guard models, or multi-agent sanctioning)

For example, in radiology, LLM-driven agents interleave “plan–act–observe” cycles with structured tool invocation, maintain context buffers integrating multiple data streams, and adapt decisions on the fly (Bluethgen et al., 10 Oct 2025). In software engineering agentic pipelines, planning, decomposition, self-critique, and tool usage are modularized for task adaptivity and process transparency (Guo et al., 10 Oct 2025).

2. Key Mechanisms: Reasoning, Tool Use, and Memory

Agentic operation is dominated by three mechanism classes:

a. Multistep Reasoning and Internal Looping:

LLMs combine “chain-of-thought” prompting, reflection, and context-aware planning (e.g., Reason–Act–Observe, OODA loops) instead of producing single-pass outputs or relying purely on static prompts (Bousetouane, 1 Jan 2025, Zhang et al., 3 Sep 2025, Bluethgen et al., 10 Oct 2025, Casella et al., 9 Mar 2025). Loop efficiency, milestone completion, and plan sufficiency metrics are used to quantify agentic reasoning (Bluethgen et al., 10 Oct 2025).

b. Tool Integration and Secure Invocation:

Agentic systems execute external calls—via JSON schemas, model function signatures, or tool-invocation prompts (TIPs)—to calculators, data retrievers, simulators, IDEs, robotic APIs, and more. The Tool Invocation Prompt (TIP), defined as

TIP=TIP(dtool,ftool,stool),TIP = TIP(d_{tool}, f_{tool}, s_{tool}),

includes tool descriptions, invocation format schemas, and guard rules (Liu et al., 6 Sep 2025). TIP security is critical: sophisticated attacks (e.g. remote code execution, parser DoS) can be mounted via prompt injection or manipulation of dtoold_{tool} and ftoolf_{tool}, exploiting lax schema enforcement or weak guard models. Ensuring robustness requires formal schema validation, layered guard models, independent runtime monitoring, and comprehensive audit logging (Liu et al., 6 Sep 2025).

c. Agentic Memory and Retrieval:

Advanced systems encode observations, intermediate reasoning, and tool returns as “atomic” notes with rich attributes (content, context, tags, embeddings), realizable as Zettelkasten-style graphs with dynamic inter-note linking and evolution of representations (Xu et al., 17 Feb 2025). This supports context-aware reasoning at scale, efficient retrieval, and prompts with continuously updating understanding. Such approaches achieve large gains in performance, token efficiency, and fidelity on long-horizon tasks relative to graph-DB or flat episodic memory baselines (Xu et al., 17 Feb 2025).

3. Security, Trust, and Governance

LLM-based agentic systems introduce new risks that require TRiSM-driven frameworks (Trust, Risk, and Security Management) (Raza et al., 4 Jun 2025). Salient points include:

  • Attack Surface and Threats: Agentic architectures are exposed to adversarial prompt injection, memory poisoning, tool hijacking (via TIP exploitation), and emergent multi-agent collusion or feedback loops. Attackers may access tool schemas, manipulate tool descriptions, or register malicious tools (e.g., via MCP servers), conducting Denial-of-Service or logic-based RCE attacks (Liu et al., 6 Sep 2025).
  • Governance: Comprehensive oversight must combine audit trails, versioned artifact logs, role-based or attribute-based access controls (e.g., as in SAGA, AAC), and adaptive, human-in-the-loop escalation for sensitive actions (Li et al., 13 Oct 2025, Syros et al., 27 Apr 2025).
  • Access Control Evolution: Static binary allow/deny is supplanted by multi-dimensional, context-sensitive information governance—scoring identity, relationship, scenario, and norm dimensions, then dynamically selecting output transformations (summarization, redaction, paraphrasing) to minimize risk-utility trade-offs (Li et al., 13 Oct 2025).
  • Best Practices:
    • Strict, machine-verified interface schemas (e.g. JSON-Schema enforcement for tool calls)
    • Independent verification of tool schemas before prompt expansion
    • Isolation of critical guard rules outside dynamic, LLM-editable descriptions
    • Runtime monitoring for anomalous invocations or memory accesses
    • Defense-in-depth: multi-model consensus, continuous adversarial testing, and prompt/fuzz testing (Liu et al., 6 Sep 2025, Raza et al., 4 Jun 2025).
  • Liability and Principal-Agent Considerations: Liability attribution must account for both inherent single-agent instability (irreproducibility, prompt idiosyncrasy) and emergent MAS behaviors (failure cascades, agent collusion, oversight breakdown). Auditability, role allocation, and “delegation boundary” specification are critical to clarify responsibility in deployment and litigation (Gabison et al., 4 Apr 2025).

4. Evaluation Methodologies and Metrics

Robust evaluation of agentic systems requires metrics at multiple levels (Bluethgen et al., 10 Oct 2025, Pehlke et al., 10 Nov 2025, Raza et al., 4 Jun 2025):

Level Metric Examples Purpose
Planning Plan accuracy, omission/insertion rates Assess the logical structure of agentic plans
Execution Tool-use accuracy, milestone hit rate, loop efficiency Quantify stepwise reliability and process efficiency
Outcome Task success rate, expert/LLM scoring, pass@k, calibration Verify substantive output quality and robustness
System-level Efficiency gain (e.g., ΔT), throughput, error cascade impact Measure end-to-end impact and resilience
Multi-agent Component Synergy Score (CSS), Tool Util. Efficacy (TUE) Rate collaboration effectiveness and tool-call reliability

Qualitative evaluation (e.g., human or LLM-as-judge rubrics, case studies) and ablations (with and without memory evolution, self-refinement, reflection, etc.) are key to isolate contributions and diagnose weaknesses (Pehlke et al., 10 Nov 2025, Zhang et al., 3 Sep 2025). Agentic failure attribution is particularly challenging—counterfactual replay and reinforcement-learned tracers (AgenTracer-8B) can be used to pinpoint decisive errors, showing up to 18.18% accuracy gains over proprietary LLMs in multi-agent trace diagnosis (Zhang et al., 3 Sep 2025).

5. Applications and Domain-Specific Patterns

LLM-based agentic architectures are deployed in high-complexity, high-vigilance domains:

  • Software Engineering:

Agentic systems orchestrate planning, code synthesis, iterative testing/self-refinement, multi-agent collaboration, formal verification, and continuous memory updates. Benchmarks span function-level and repository-level tasks (e.g., HumanEval, SWE-Bench), with agentic methods (MAGIS, AutoCodeRover) often achieving 20–30% gains on complex tasks over prompting/fine-tuning (Guo et al., 10 Oct 2025).

  • Healthcare and Radiology:

Agents integrate with DICOMweb, FHIR, and hospital IT networks, coordinate task flows ranging from report drafting to MDT management, invoke image analysis tools, and manage patient-specific context buffers (Bluethgen et al., 10 Oct 2025).

  • Security and Governance:

Security architectures such as SAGA employ provider-mediated agent registration, cryptographic access tokens, and fine-grained contact policies, balancing lifecycle control and delegated computation with minimal runtime overhead (Syros et al., 27 Apr 2025). AegisLLM demonstrates multi-agent cooperative defense, adaptive prompt optimization, and fast runtime response to evolving adversarial threats (Cai et al., 29 Apr 2025).

  • Mathematical Reasoning/Data Generation:

Agentic multi-stage pipelines (e.g., AgenticMath) orchestrate roles in filtering, paraphrasing, solution augmentation, and QA evaluation, enabling the creation of compact, high-quality, domain-diverse datasets that outperform much larger naive collections for supervised fine-tuning (Liu et al., 22 Oct 2025).

  • Social Simulation and Decision Support:

Multi-agent LLM systems (Generative Agents, Adaptive Decision Discourse frameworks) are used to simulate emergent norm formation, collaborative problem-solving, and breadth-first exploration of strategy spaces in multi-stakeholder scenarios (Haase et al., 2 Jun 2025, Dolant et al., 16 Feb 2025).

6. Research Challenges and Frontiers

Major challenges and future research directions include:

7. Synthesis and Best Practices

Practitioners should adhere to the following (explicitly emphasized in the literature):

LLM-based agentic systems constitute a rapidly evolving field where system design, operational governance, evaluation, and secure deployment must be tightly integrated. Success in mission-critical or high-stakes domains is predicated on rigorous application of layered security, modular architecture, robust memory, and explainable, auditable reasoning traces (Liu et al., 6 Sep 2025, Raza et al., 4 Jun 2025, Bluethgen et al., 10 Oct 2025, Zhao et al., 25 Aug 2025, Bousetouane, 1 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to LLM-based Agentic Systems.