Cognitive Architectures for Language Agents (2309.02427v3)

Published 5 Sep 2023 in cs.AI, cs.CL, cs.LG, and cs.SC

Abstract: Recent efforts have augmented LLMs with external resources (e.g., the Internet) or internal control flows (e.g., prompt chaining) for tasks requiring grounding or reasoning, leading to a new class of language agents. While these agents have achieved substantial empirical success, we lack a systematic framework to organize existing agents and plan future developments. In this paper, we draw on the rich history of cognitive science and symbolic artificial intelligence to propose Cognitive Architectures for Language Agents (CoALA). CoALA describes a language agent with modular memory components, a structured action space to interact with internal memory and external environments, and a generalized decision-making process to choose actions. We use CoALA to retrospectively survey and organize a large body of recent work, and prospectively identify actionable directions towards more capable agents. Taken together, CoALA contextualizes today's language agents within the broader history of AI and outlines a path towards language-based general intelligence.

PDF Abstract

The emergence of LLMs augmented with external tools and structured reasoning processes has led to the development of sophisticated language agents. These agents demonstrate enhanced capabilities in tasks requiring grounding, planning, and interaction. However, the rapid proliferation of diverse agent designs necessitates a unifying framework for systematic analysis and future development. The Cognitive Architectures for Language Agents (CoALA) framework (Yao et al., 2023 ) addresses this need by drawing inspiration from traditional cognitive architectures in cognitive science and symbolic AI, adapting these concepts for modern LLM-based systems. CoALA provides a conceptual blueprint delineating the core components and processes underlying language agents.

The CoALA Framework

CoALA posits that a language agent can be decomposed into three primary components: a modular memory system, a structured action space, and a generalized decision-making module. This structure mirrors classical cognitive architectures like Soar and ACT-R, which emphasize distinct memory stores, production rules or operators for actions, and a central control mechanism. However, CoALA adapts this blueprint to the specific characteristics of LLM-based agents, where the LLM often serves multiple roles, including controller, planner, and sometimes even a component of the memory system itself.

The core idea is to move beyond monolithic LLM prompting towards modular designs where specific functions are handled by dedicated components, orchestrated by a central decision-making process. This modularity facilitates analysis, comparison, and systematic improvement of agent capabilities. CoALA serves not only as a descriptive tool for existing agents but also as a prescriptive guide for constructing more advanced agents.

Memory Modules

CoALA defines memory as the agent's internal state, encompassing all information available to it at any given time. It distinguishes between different types of memory, analogous to human cognitive models:

Working Memory: This corresponds to the information actively being processed and immediately accessible for decision-making. In LLM-based agents, this is typically implemented using the LLM's context window. Information is loaded into the prompt (e.g., task description, recent conversation history, retrieved documents, intermediate reasoning steps). The size limitation of the context window ( $L$ ) imposes a significant constraint on the working memory capacity. Strategies like prompt compression, summarization, and selective retrieval are employed to manage this limited space effectively.
Long-Term Memory: This stores information persistently beyond the current interaction context. It allows agents to retain knowledge, past experiences, and user preferences over extended periods. Common implementations involve external vector databases where information is stored as embeddings. Retrieval mechanisms, often based on semantic similarity search (e.g., using cosine similarity between query embeddings and stored embeddings), allow relevant information to be fetched and loaded into the working memory (prompt) when needed. Other forms include symbolic databases or structured knowledge graphs. Managing consistency, relevance, and retrieval efficiency from long-term memory are critical implementation challenges.

The interaction between working and long-term memory is crucial. Information flows from long-term to working memory via retrieval operations, and potentially from working memory back to long-term memory via storage or consolidation operations, allowing the agent to learn and adapt over time.

Action Space Formulation

CoALA conceptualizes the agent's capabilities through a structured action space, defining the operations an agent can perform to interact with its environment or modify its internal state. Actions are categorized into:

Internal Actions: These operations manipulate the agent's internal memory components. Examples include:
- Adding, modifying, or deleting information in working memory (e.g., updating a scratchpad within the prompt).
- Retrieving information from long-term memory and loading it into working memory.
- Storing information from working memory into long-term memory.
- Performing internal reasoning steps, like generating intermediate thoughts or summarizing information (often implemented via specific LLM calls).
- Updating agent goals or plans stored in memory.
External Actions: These operations allow the agent to interact with the external world beyond its internal memory. Examples include:
- Calling external tools or APIs (e.g., web search engines, calculators, code interpreters, domain-specific databases).
- Generating textual responses to communicate with users.
- Executing code or commands in an external environment (e.g., a terminal or simulated environment).

The action space ( $A$ ) is often predefined but can potentially be learned or expanded over time. In many implementations, the LLM itself is used to parse available actions and generate syntactically correct commands or function calls, often facilitated by techniques like tool-use prompting or specialized decoders. The design of the action space significantly determines the agent's capabilities and the domains it can operate in.

Decision-Making Mechanisms

The decision-making module is the central controller that selects the next action ( $a_t$ ) for the agent to perform at each time step ( $t$ ), based on the current memory state ( $m_t$ ) and the agent's overall goals ( $g$ ). CoALA represents this as a policy $\pi(a_t | m_t, g)$ . Different implementations employ various strategies:

Fixed Policies: Simple agents might use hardcoded logic or predefined sequences of actions (e.g., fixed prompt chains).
LLM-based Reasoning: This is the dominant approach. The LLM is prompted with the current memory state (including goals, history, retrieved information) and a description of available actions. The LLM then reasons about the best course of action and outputs the chosen action, often along with its reasoning justification. Techniques like Chain-of-Thought (CoT) prompting encourage step-by-step reasoning before action selection. The ReAct (Yao et al., 2022 ) paradigm explicitly interleaves reasoning steps (internal thought generation) with external actions.
Learned Policies: Reinforcement learning (RL) or imitation learning can be used to train a policy network (which could be the LLM itself or a separate model) to select optimal actions based on feedback or expert demonstrations. This allows agents to adapt and improve their decision-making strategies over time, although it typically requires significant training data or interaction experience.

The decision-making process often involves planning, where the agent forecasts the consequences of potential action sequences to achieve its goals. This can range from simple one-step lookahead driven by the LLM's reasoning to more complex search algorithms or learned planning modules.

Taxonomy of Language Agents via CoALA

The CoALA framework provides a structured way to analyze and categorize the diverse landscape of existing language agents. By examining how different agents implement the memory, action, and decision-making components, CoALA reveals underlying design patterns and trade-offs:

ReAct Agents: Primarily use the LLM's context window as working memory, leverage external tools (web search, lookup) as external actions, and employ LLM-based reasoning (interleaving thought and action) for decision-making. Long-term memory is typically limited.
Auto-GPT / BabyAGI Variants: Employ more sophisticated memory management, often using vector stores for long-term task lists and results. They feature complex decision-making loops involving task planning, execution, and result storage, heavily relying on LLM calls for each step. Their action spaces often include code execution and file system operations.
Tool-Augmented LLMs (e.g., Toolformer (Schick et al., 2023 ), Gorilla (Patil et al., 2023 )): Focus on expanding the external action space by enabling LLMs to reliably call diverse APIs. Decision-making centers on determining when to call a tool and which tool to call, often involving fine-tuning or specialized prompting. Memory is typically confined to the context window.
Retrieval-Augmented Generation (RAG) Systems: Can be viewed as agents focusing on the interplay between working memory (prompt) and long-term memory (retrieval database). The core decision is when and what to retrieve to augment the generation context. Actions are primarily internal (retrieval) and text generation.

Using CoALA highlights differences in memory persistence, action repertoires, and the sophistication of the control flow or reasoning strategies employed across these different agent architectures.

Future Research Directions

CoALA points towards several avenues for advancing language agent capabilities:

Enhanced Memory Systems: Developing more efficient and effective long-term memory mechanisms, including better retrieval algorithms, memory consolidation processes, hierarchical memory structures, and mechanisms for handling memory decay or forgetting irrelevant information. Overcoming the limitations of fixed-size context windows remains crucial.
Richer Action Spaces: Expanding the range and reliability of both internal and external actions. This includes improving tool use, developing more complex internal reasoning primitives, and enabling agents to learn new skills or actions.
Sophisticated Decision-Making: Moving beyond simple LLM-based reasoning loops towards more robust planning, reasoning under uncertainty, and meta-cognitive abilities (e.g., self-reflection, self-correction). Integrating learning, particularly RL, to optimize decision policies based on experience is a key direction.
Unified Learning: Developing architectures where memory, action selection, and skill acquisition are learned end-to-end or co-adapted, rather than relying solely on the pre-trained capabilities of the foundational LLM.
Modularity and Integration: Exploring optimal ways to combine symbolic reasoning components with LLMs within the CoALA structure, leveraging the strengths of both paradigms.

Conclusion

The Cognitive Architectures for Language Agents (CoALA) framework provides a valuable conceptual tool for understanding, organizing, and advancing the development of language agents. By decomposing agents into distinct memory, action, and decision-making components inspired by cognitive science, CoALA facilitates systematic analysis of existing systems and offers a structured approach for designing future agents with enhanced reasoning, planning, and interactive capabilities. It bridges the gap between the empirical successes of recent language agents and the long-standing principles of cognitive architecture research.