Foundation Agents: Architectures & Systems

Updated 4 July 2026

Foundation agents are interactive, goal-directed systems that extend static foundation models by integrating perception, planning, and execution for decision-making in dynamic environments.
They use modular architectures combining memory pipelines, planning engines, execution modules, and responsible AI guardrails to handle complex sequential tasks.
Recent advances leverage self-supervised and reinforcement learning along with deployment-time memory adaptations to boost performance across domains like GUI control and web navigation.

Searching arXiv for papers on foundation agents and closely related architectures, benchmarks, and systems. arXiv search query: "foundation agents" Foundation agents are goal-directed systems built on foundation models but extended beyond static knowledge acquisition into interaction with dynamic environments, where they perceive state, interpret goals, plan multi-step behavior, execute actions, observe consequences, recover from errors, and improve through interaction (Liu et al., 2024). In the research literature, the term spans several related emphases: a paradigm for sequential decision making grounded in large-scale interactive pretraining and adaptation (Liu et al., 2024); a software-architecture family organized around memory, planning, execution, reflection, and guardrails (Lu et al., 2023, Zhou et al., 2024, Liu et al., 2024); and a growing class of deployed or deployable systems for web navigation, GUI control, software engineering, persistent personalization, industrial assistance, and multi-agent coordination (Liu et al., 2024, Zhong et al., 13 May 2026, Lei et al., 8 Jun 2026, Henkel et al., 4 May 2026, Liu et al., 22 May 2026). Across these strands, the central shift is from foundation models as predictors or generators to foundation agents as actors whose competence depends not only on pretrained weights, but also on interfaces, memory pipelines, runtime substrates, tool access, and post-training through environment feedback (Liu et al., 2024, Zhong et al., 13 May 2026, Liu et al., 5 Jun 2026).

1. Concept and defining properties

A recurring formulation treats foundation agents as the decision-making analogue of foundation models: broadly capable systems trained or adapted at scale so that they can generalize across tasks, domains, and modalities instead of being optimized only for one narrow environment (Liu et al., 2024). One position paper characterizes them through three properties: a unified representation of variables involved in decision process, a unified policy interface across tasks and domains, and interactive decision-making in physical and virtual worlds (Liu et al., 2024). A more operational definition, instantiated in GUI control, describes a foundation agent as a general-purpose agent built on a foundation model family but extended into goal-directed interaction with dynamic environments, capable of perceiving GUI state, interpreting human goals, planning multi-step behavior, executing actions on digital devices, observing consequences, recovering from errors, and improving through interactive training (Liu et al., 2024).

The architectural literature sharpens that intuition by treating autonomy as arising from foundation-model capabilities combined with explicit subsystems for interaction engineering, memory, planning, execution, and responsible-AI controls (Lu et al., 2023). In that view, the user supplies a high-level goal rather than a full action script, and the agent must decompose the goal into manageable tasks, generate strategies, select tools or collaborators, execute actions, and revise plans under feedback (Lu et al., 2023). A closely related taxonomy defines foundation-model-based agents as architected systems whose capabilities are determined by choices about input modality, memory management, tool use, planning engine, workflow, roles, reflection, learning capability, autonomy level, and control of underlying AI technology (Zhou et al., 2024).

This suggests a useful conceptual boundary. A foundation agent is not merely an LLM chatbot, because it is expected to act under environmental feedback rather than only respond to prompts. It is also not identical to a task-specific RL policy, because it is designed to exploit the transfer, multimodality, and broad adaptation associated with foundation models (Liu et al., 2024). A plausible implication is that the term now denotes a systems category rather than a single model class: model weights remain important, but competence increasingly depends on how the model is embedded into memory, interfaces, tools, and verification loops (Zhong et al., 13 May 2026).

2. Architectural anatomy

The reference-architecture literature converges on a modular decomposition. One pattern-oriented architecture for foundation-model-based agents places six major areas inside a single agent: interaction engineering, memory, planning, execution engine, responsible AI plugins, and AI models (Lu et al., 2023). Interaction engineering includes context engineering, goal creation, persona creation, and prompt/response generation; memory is split into short-term and long-term stores with retrieval from long-term to short-term; planning includes single-path or multi-path plan generation plus self-, cross-, or human reflection; the execution engine handles task execution, monitoring, tool and agent selection, and multi-agent cooperation; responsible-AI plugins provide risk assessment, guardrails, black-box recording, explainers, provenance checks, and co-versioning; and the model layer may include external, fine-tuned, sovereign, or multiple foundation models (Lu et al., 2023).

A related taxonomy presents similar architecture choices as design options: single- versus multi-modal input, single-model versus multi-model composition, memory format and operations, internal versus external planning engines, centralized versus decentralized workflows, coordinator versus worker roles, single-path versus multi-path planning, and self-, cross-, human-, or environmental reflection (Zhou et al., 2024). The pattern-catalogue literature adds reusable patterns for passive or proactive goal creation, prompt/response optimization, retrieval-augmented generation, one-shot or incremental model querying, single-path or multi-path planning, self-, cross-, or human reflection, voting-, role-, or debate-based cooperation, multimodal guardrails, and tool/agent registries (Liu et al., 2024). Together these works treat foundation-agent design as an architectural discipline rather than an exercise in prompt engineering alone.

Concrete systems instantiate these abstractions differently. AutoGLM, for example, uses a planner that reasons over the user instruction and environment state, a grounder/executor that maps abstract intentions to concrete GUI targets and actions, a perception layer reading textual and visual observations, and an iterative agent loop in which the model acts, observes results, updates context, and continues until success or termination (Liu et al., 2024). The paper emphasizes that planning and grounding are distinct competencies and argues for an “intermediate interface” in which the planner outputs an abstract, semantically meaningful action description while a separate grounder resolves that description into the correct GUI element (Liu et al., 2024). That separation is motivated by different optimization requirements: planning needs flexibility and compositionality, while grounding needs high local accuracy (Liu et al., 2024).

In software engineering, the same systems logic appears under a different name: the “harness.” “AI Harness Engineering” formalizes a runtime substrate surrounding a foundation-model software agent, with eleven responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording (Zhong et al., 13 May 2026). This reframes software-agent performance as a property of the model–harness–environment system rather than the model alone, and proposes a four-level ladder from minimal support to fully observable, verification-driven episodes (Zhong et al., 13 May 2026). This suggests that the architectural substrate plays a role for software agents analogous to the planner-grounder interface in GUI agents: reliability depends on explicit runtime structure.

3. Learning, adaptation, and self-improvement

The training literature argues that conventional pretraining is insufficient because internet corpora contain little direct evidence of sequential decision making (Liu et al., 2024). One roadmap proposes three stages for foundation agents: large-scale interactive data collection or generation, self-supervised pretraining and downstream adaptation, and knowledge and value alignment with LLMs (Liu et al., 2024). In that framing, trajectory data play the role that web text plays for LLMs, and self-supervised objectives over trajectories can learn reusable structure before downstream optimization (Liu et al., 2024). The same paper lists pretraining objectives such as next action prediction, reward-conditioned action prediction, forward and inverse dynamics prediction, and random masking prediction over trajectories (Liu et al., 2024).

System papers make this agenda concrete. AutoGLM starts from behavior cloning on approximately 1,000 BC data provided by VisualAgentBench and initializes GLM-4-9B to 22.4% success rate on its web setting, but explicitly argues that behavior cloning alone is costly and conceptually limited because oracle demonstrations do not expose the agent to failure states or teach error recovery well (Liu et al., 2024). To move beyond that regime, the authors introduce self-evolving online curriculum reinforcement learning, instantiated as WebRL, in which failed task instructions are mutated to become more complicated or simpler, filtered by the critic, and rolled out again (Liu et al., 2024). The corresponding stabilization mechanisms are a KL-constrained policy update and actor confidence filtered experience replay (Liu et al., 2024).

Another line pursues autonomous skill acquisition rather than only curriculum generation. “Proposer-Agent-Evaluator” (PAE) defines a contextual MDP with a hidden task distribution and reward function, then replaces them during training with a context-aware task proposer and a proxy reward model implemented as an autonomous VLM-based evaluator (Zhou et al., 2024). The proposer generates tasks conditioned on website context such as the website name or user demos; the agent attempts those tasks through screenshot-based web interaction with explicit Thought and Action outputs; and the evaluator judges success from the final answer and the last three screenshots, producing sparse 0/1 rewards (Zhou et al., 2024). Policy optimization is performed by Filtered Behavior Cloning over successful online trajectories (Zhou et al., 2024). This is a concrete example of a foundation agent learning from self-generated tasks and model-based rewards in the wild rather than relying solely on fixed human-authored instructions.

Online self-improvement appears even more explicitly in Continual Harness. That framework treats the harness itself as the adaptation target and allows an embodied agent to refine its own prompt, sub-agents, skills, and memory online within a single run, without environment resets (Karten et al., 11 May 2026). The paper presents this as a reset-free alternative to prompt-optimization methods that require episodic rollouts, and then extends it to a model-harness co-learning loop in which a frontier teacher relabels rollout windows and an open-source model is updated while the environment continues uninterrupted (Karten et al., 11 May 2026). This suggests that future foundation agents may improve not only by changing model parameters, but also by rewriting the runtime scaffold through which they act.

Persistent memory introduces a separate adaptation channel at deployment time. “Deployment-Time Memorization in Foundation-Model Agents” argues that long-lived agents remember users not only through model weights but through runtime memory pipelines, and studies the privacy–utility trade-off as a function of summarization aggressiveness, retrieval breadth, and deletion mode (Lei et al., 8 Jun 2026). The results show that key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall, but also reveals deletion-fidelity failures if raw records are deleted while derived summaries remain recoverable (Lei et al., 8 Jun 2026). This is not a training recipe in the usual sense, but it shows that adaptation in foundation agents increasingly occurs in the deployed memory stack, not only during offline pretraining or RL.

4. Domains, benchmarks, and empirical instantiations

GUI control is one of the clearest real-world substrates for foundation agents because it combines perception, language understanding, long-horizon planning, and concrete action while remaining safer and more tractable than robotics (Liu et al., 2024). AutoGLM reports 55.2% success rate on VAB-WebArena-Lite, improving to 59.1% with a second attempt, and 96.2% on OpenTable evaluation tasks (Liu et al., 2024). In Android control, it reaches 36.2% on AndroidLab and 89.7% on common tasks in popular Chinese apps using AccessibilityService on physical phones (Liu et al., 2024). The same paper reports large gains from its intermediate interface on VAB-WebArena-Lite: GPT-4o (text) improves from 14.3% to 18.1%, GPT-4o (visual) from 18.2% to 27.3%, and gpt-4-vision-preview (visual) from 18.8% to 36.4% (Liu et al., 2024).

VisualAgentBench generalizes the evaluation problem beyond GUIs and explicitly frames large multimodal models as “Visual Foundation Agents” (Liu et al., 2024). It covers embodied, GUI, and visual-design domains through five environments—VAB-OmniGibson, VAB-Minecraft, VAB-Mobile, VAB-WebArena-Lite, and VAB-CSS—for a total of 746 test instances and 4,482 training trajectories, with an average of 11.22 turns per training trajectory (Liu et al., 2024). The benchmark evaluates nine proprietary LMM APIs and eight open models, and reports that the best proprietary model, GPT-4o, achieves 36.2% average success rate across the five environments (Liu et al., 2024). The same work argues that current models already show nontrivial general visual-agent capability, but are still far from deployable, and demonstrates that behavior cloning can lift open models from essentially 0% preliminary prompting success to meaningful nonzero performance, with InternVL-2 reaching 16.0 average success rate after fine-tuning (Liu et al., 2024).

Domain-specific evaluation is also broadening. EcomBench introduces a holistic e-commerce benchmark built from genuine user demands in leading global e-commerce ecosystems and updated quarterly to track changing market and policy conditions (Min et al., 9 Dec 2025). It spans seven categories and three difficulty levels—20% Level 1, 30% Level 2, and 50% Level 3—and evaluates 12 production-facing systems (Min et al., 9 Dec 2025). The paper reports that ChatGPT-5.1 and Gemini DeepResearch score above 90% on Level 1 but only 46% on Level 3, while most other systems remain below 35% on Level 3 (Min et al., 9 Dec 2025). The significance is diagnostic rather than celebratory: current agents can handle many straightforward practical requests, but complex real-world tasks involving compliance, cross-source synthesis, and long-horizon reasoning remain unsolved (Min et al., 9 Dec 2025).

Other benchmarks isolate particular competencies. “Thinking on Maps” studies interactive spatial understanding in symbolic map environments under partial observability and shows that exploration strategy has limited effect once coverage is adequate, while memory representation is decisive (Wei et al., 30 Dec 2025). Structured memories, especially Node-Sequence Memory and Graph Memory, substantially improve performance on structure-intensive tasks such as path planning, and the work reports performance saturation across model versions and scales beyond a certain threshold (Wei et al., 30 Dec 2025). This supports the broader thesis that foundation-agent competence depends on the architecture of exploration, memory, and reasoning, not only on base-model scale.

Applied systems also reveal how the foundation-agent idea is being specialized. Customized FinGPT Search Agents treat financial foundation models and general LLMs as backbones within retrieval-grounded, memory-bearing agents for individuals and institutions (Tian et al., 2024). The individual-user version relies on Retrieval-Augmented Generation over preferred sources, local files, and open web data; the institutional version uses a dynamic MongoDB vector database, BERT embeddings, optional fine-tuning on proprietary data, and air-gapped infrastructure (Tian et al., 2024). Reported results show that the agent wrappers materially improve accuracy, relevance, and response time over base models across financial questions, regulatory data, link retrieval, and FinanceBench-style tasks (Tian et al., 2024). This suggests that domain grounding, memory, and organization-specific retrieval are already standard ingredients in practical foundation agents.

5. Safety, robustness, and deployment governance

The architecture literature treats responsible AI as a first-class systems concern. One reference architecture embeds continuous risk assessment, black-box recording, input and output guardrails, RAG guardrails, execution guardrails, intermediate guardrails, multimodal guardrails, explainers, AIBOM registry, and co-versioning registry directly into the agent design (Lu et al., 2023). A complementary taxonomy of runtime guardrails argues that FM-based agents must be protected not only at the prompt and output boundary, but across goals, context, memory, reasoning, plans, actions, tools, other agents, intermediate results, and final results (Shamsujjoha et al., 2024). It organizes guardrails by motivations such as accuracy, privacy, security, safety, fairness, IP protection, and compliance, and by quality attributes including accuracy, generalizability, customizability, adaptability, traceability, portability, interoperability, and interpretability (Shamsujjoha et al., 2024). The implied “Swiss Cheese” logic is that no single guardrail is sufficient; layered defenses are required across the runtime pipeline (Shamsujjoha et al., 2024).

Deployment-time memory makes these governance issues concrete. Persistent agent memory creates a distinct memorization channel governed by summarization, retrieval, and deletion design choices, and therefore must be evaluated by what it helps agents recall, what it makes extractable, and what it can truly erase (Lei et al., 8 Jun 2026). The introduction of metrics such as Personalization Recall, Adversarial Extraction Rate, Privacy–Utility AUC, and Forgetting Residue Score indicates that memory engineering is becoming a first-class safety and compliance problem rather than an incidental product feature (Lei et al., 8 Jun 2026). A plausible implication is that future foundation-agent audits will need to treat memory stacks the way model cards treat weights: as explicit objects of red-teaming, evaluation, and governance.

Robustness under deployment shift is another emerging concern. “The Sim-to-Real Gap of Foundation Model Agents” formalizes agent evaluation and training mismatch as a classical sim-to-real problem over the four elements of an MDP: observation, action, transition, and reward (Liu et al., 5 Jun 2026). It defines the sim-to-real gap of a policy as $G(\pi) := \psi_s(\pi) - \psi_r(\pi)$ and uses multilingual tool calling as a concrete example of an observation-space gap where semantic intent is correct but parameter values become operationally invalid (Liu et al., 5 Jun 2026). The paper argues that the community should adopt standardized stress tests and classical solutions such as domain randomization rather than treating each deployment failure as a wholly novel LLM phenomenon (Liu et al., 5 Jun 2026). This suggests that reliability for foundation agents will increasingly be analyzed as a structured closed-loop problem rather than only as prompt robustness.

In industrial settings, these concerns are already limiting deployment. A systematic review of foundation-model-based agents in industrial automation finds that 75.0% of reported systems are at TRL 4–6 and only 9.1% show deployment-oriented evidence (Henkel et al., 4 May 2026). The most widely reported limitations are lack of generalization, hallucination and output instability, data scarcity, and inference latency (Henkel et al., 4 May 2026). At the same time, the survey finds strong gains over conventional industrial agents in human interaction (+37.1 percentage points), dealing with uncertainty (+35.3 pp), adaptivity (+23.5 pp), and learning (+20.0 pp), but a pronounced deficit in negotiation (−39.0 pp) (Henkel et al., 4 May 2026). This indicates that foundation agents are already valuable as semi-autonomous, interaction-centric industrial assistants, but remain far from broadly trusted autonomous decision-makers (Henkel et al., 4 May 2026).

6. Coordination, infrastructure, and future directions

As agent deployments scale, several papers argue that the bottleneck shifts from isolated capability to coordination. “Foundation Protocol” proposes a graph-first coordination layer in which entities such as agents, humans, tools, resources, institutions, and organizations are nodes, while relationships, memberships, and sessions are edges (Liu et al., 22 May 2026). The protocol defines seven core objects—Entity, Session, Activity, Envelope, Event, Receipt / Settlement, and Provenance—and organizes them into an Entity & Trust Plane, Transport & Routing Plane, Interaction & Organization Plane, Regulation & Oversight Plane, and Configuration & Profiles Plane (Liu et al., 22 May 2026). Economic primitives for metering, receipts, and settlement are part of the same foundation layer as policy, provenance, and audit (Liu et al., 22 May 2026). This reflects a broader shift: once agents browse, purchase, deploy software, and manage systems, coordination itself becomes shared infrastructure.

Software engineering provides an analogous lesson at the runtime level. AI Harness Engineering argues that the relevant question is not whether a foundation model can produce a patch, but whether the model–harness–environment system can produce a verifiably correct, attributed, and maintainable change (Zhong et al., 13 May 2026). The paper operationalizes this through auditable episode packages containing action traces, tool traces, context traces, verification traces, failure-attribution logs, intervention logs, entropy audits, and final outcome records (Zhong et al., 13 May 2026). Continual Harness extends the same systems view to embodied agents by letting the harness itself become the object of online adaptation (Karten et al., 11 May 2026). Taken together, these works suggest that the runtime substrate—interfaces, permissions, memory, verification, event streams, and oversight—may be as important for the next generation of foundation agents as scaling laws were for the last generation of foundation models.

The field’s open problems are correspondingly broad. The position paper on foundation agents calls for theory that connects pretraining and task-specific objectives, generalized multimodal state and action spaces, and control-as-inference or information-theoretic formulations appropriate to broad sequential decision making (Liu et al., 2024). Benchmark papers point toward dynamic, contamination-aware evaluation tied to changing external environments, not only static academic datasets (Min et al., 9 Dec 2025). Architecture papers call for stronger decision models for selecting among memory, planning, coordination, and guardrail patterns (Zhou et al., 2024, Liu et al., 2024). Deployment studies argue for tier-aware deletion, richer benchmarking, and governance mechanisms that treat memory, provenance, and policy as first-class (Lei et al., 8 Jun 2026, Liu et al., 22 May 2026). This suggests that “foundation agents” is becoming the name of a convergent research program: building generally capable acting systems by combining pretrained foundation models with explicit architecture, interactive training, runtime memory, verification, and governable coordination layers (Liu et al., 2024, Liu et al., 2024).