Papers
Topics
Authors
Recent
Search
2000 character limit reached

MARS: Metacognitive Self-Improvement Agents

Updated 24 January 2026
  • MARS is a class of generative agent architectures that integrate a metacognitive layer for explicit introspection, self-evaluation, and autonomous learning.
  • They use a dual-process model where fast, intuitive actions are paired with slow, reflective policy revisions to optimize task performance.
  • MARS frameworks extend to multi-agent systems and employ advanced memory management and reflection techniques for robust self-improvement.

Metacognitive Agent Reflective Self-Improvement (MARS) refers to a class of generative agent architectures that possess explicit mechanisms for introspection, self-evaluation, strategy revision, and autonomous learning. MARS frameworks are designed to enable agents to significantly enhance their goal-directed performance by continuously observing, evaluating, and modifying their own cognitive processes, typically through a dedicated metacognitive layer. Inspired by both cognitive psychology (dual-process theory) and computational metacognition, MARS systems operationalize a model of self-improvement that is formal, introspective, and dynamically adaptive across diverse domains and tasks (Toy et al., 2024, Hou et al., 17 Jan 2026, Liu et al., 5 Jun 2025, Zhao et al., 23 May 2025, Ozer et al., 23 Dec 2025, Liang et al., 25 Mar 2025, 0807.4417, Cox et al., 2022, Bilal et al., 20 Apr 2025).

1. Formal Framework and Objectives

A MARS agent is characterized by the augmentation of standard generative policies with a metacognitive module responsible for explicit monitoring and adaptation of reasoning. At each timestep tt, the agent maintains:

  • State: sts_t (environment-agent state)
  • Observations: otOo_t \in O
  • Action: atπt1(st1,Mt1)a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})
  • Memory: MtM_t (multiset of experiences, thoughts, meta-thoughts)
  • Self-evaluation: et=E(st1,at,Mt1)e_t = E(s_{t-1}, a_t, M_{t-1}), with E:S×A×MRE: S \times A \times M \rightarrow \mathbb{R}

The optimization objective is to maximize cumulative self-evaluation,

maxπt=1Tet,\max_\pi \sum_{t=1}^T e_t,

where ete_t measures internally scored progress toward task completion or cognitive goals. The internal self-evaluation is not identical to an extrinsic reward; it is an agent-generated, context-aware signal that drives learning and adaptation (Toy et al., 2024, Liu et al., 5 Jun 2025).

Formally, MARS decomposes into three interacting components (Liu et al., 5 Jun 2025):

  • Metacognitive Knowledge (KK): structured beliefs over skills, tasks, and strategies,
  • Metacognitive Planning (PP): selection of learning targets and strategies,
  • Metacognitive Evaluation (EE): reflection on cognitive/learning outcomes to update KK and PP.

2. Architectural Design: Dual-System and Multi-Agent Extensions

The canonical MARS agent instantiates a dual-process architecture:

  • System 1: Fast, habitual, and intuitive inference, operationalized as the standard LLM-driven policy π(s,M)\pi(s, M). This system executes immediate action selection and basic reasoning.
  • System 2: Slow, deliberative, and reflective, implemented as a metacognitive controller. System 2 monitors System 1 outputs and internal states, triggers introspection when performance is sub-threshold, generates meta-questions (e.g., "How can I improve?"), and revises strategies, prompts, or weights.

Memory management is explicit: each memory miMm_i \in M carries an embedding viv_i and an importance weight. Memory is pruned via top-KK importance rules to conform with context window constraints.

Extension to multi-agent MARS arises by distributing introspective functions across a set of specialized subagents—critics, judges, supervisors, debaters—enabling modular role assignments, structured disagreement, and learned aggregation of self-assessment (Ozer et al., 23 Dec 2025, Bilal et al., 20 Apr 2025).

3. Reflective Dynamics and Self-Improvement Cycle

MARS executes a recurrent cycle of action, monitoring, introspection, and policy revision:

  1. Observe: Ingest new observation oto_t; update MtM_t with salient experiences.
  2. Act: System 1 generates action ata_t using the current policy conditioned on st1s_{t-1}, Mt1M_{t-1}.
  3. Self-Evaluate: System 2 computes et=E(st1,at,Mt1)e_t = E(s_{t-1}, a_t, M_{t-1}), scoring recent behavior based on internal criteria.
  4. Introspect: If et<θe_t < \theta, metacognitive routines engage. The agent generates meta-questions and synthesizes meta-thoughts (e.g., "What went wrong?"), which are appended to MtM_t.
  5. Revise Policy: A new strategy prompt or policy update is synthesized and installed as πt\pi_t.
  6. Stopping/Convergence: The cycle iterates until ete_t converges to threshold θ\theta for KK steps or a maximum number of introspections is reached. Convergence is defined by et+1et<ε|e_{t+1} - e_t| < \varepsilon with ε\varepsilon small (Toy et al., 2024).

Policy updates can be cast as minimization of a self-evaluation weighted loss: L(π)=t=1Tlogπ(atst,Mt)et,\mathcal{L}(\pi) = -\sum_{t=1}^T \log \pi(a_t | s_t, M_t) \cdot e_t, thus reinforcing actions that yield higher self-evaluation (Toy et al., 2024).

Pseudocode Summary (System 2 Loop):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function MARS_Agent_Step(s_{t-1}, M_{t-1}, π_{t-1}):
    o_t = ObserveEnvironment()
    M_temp = UpdateMemory(M_{t-1}, o_t)
    a_t  π_{t-1}(s_{t-1}, M_temp)
    e_t = SelfEvaluate(s_{t-1}, a_t, M_temp)
    
    if e_t < θ:
        Q_meta = GenerateMetaQuestion(M_temp, history)
        meta_thought = LLM("How can I improve given Q_meta and M_temp?")
        M_temp.append(meta_thought)
        strategy_prompt = SynthStrategyPrompt(M_temp, meta_thought)
        π_t = RepromptPolicy(strategy_prompt)
    else:
        π_t = π_{t-1}
    return (a_t, M_temp, π_t)
(Toy et al., 2024)

4. Multi-Agent MARS and Preventing Thought Degeneration

Multi-agent extensions of MARS, typified by MAR (Multi-Agent Reflexion), mitigate the degeneration of self-reflection by incorporating a pool of persona-specific critics and a judge. Each critic analyzes failed trials from distinct methodological perspectives (e.g., skepticism, verification, creativity), producing diverse reflections. The central judge aggregates these into a consensus, yielding more robust and diverse self-improvement updates (Ozer et al., 23 Dec 2025).

MAR Algorithmic Loop:

  • Actor generates solution,
  • Evaluator checks correctness,
  • Persona critics each reflect and generate their perspectives,
  • Judge synthesizes reflections,
  • Actor is prompted with consensus reflection for subsequent attempts.

Empirical benchmarks demonstrate that MAR (multi-agent) yields higher accuracy than single-agent reflexion, with HotPotQA EM increasing to 47% (vs 44% for single-agent reflexion and 32% for vanilla ReAct) (Ozer et al., 23 Dec 2025).

Tabular summary (HotPotQA, trial 5):

Method EM (%)
Baseline (ReAct) 32.0
Reflexion 44.0
MAR (multi-agent) 47.0

5. Principle-Based and Procedural Metacognitive Reflection

Recent MARS frameworks synthesize human-inspired reflection modalities for efficient self-improvement (Hou et al., 17 Jan 2026):

  • Principle-Based Reflection: Abstracts normative avoidance rules from error clusters, providing explicit "what to avoid" enhancements (concise warnings or "dos/don'ts").
  • Procedural Reflection: Derives stepwise strategies from successful reasoning traces, formulating guides to "how to succeed" (reasoning checkpoints, algorithmic steps).

A single-cycle algorithm processes diagnostic failures, clusters error types, and distills both principle and procedural enhancements, which are incorporated into new prompts. This approach yields state-of-the-art or near state-of-the-art performance across benchmarks (e.g., DROP F1 plus 6.4 points over zero-shot baseline; MMLU plus 4.7 points over zero-shot-CoT base) with a fraction of the computational cost of recursive agents (Hou et al., 17 Jan 2026).

Example prompt with principle-based enhancement:

1
2
3
4
## GUIDANCE
– [!] Don’t confuse enthalpy with internal energy.
– [!] Always check unit consistency (Kelvin vs Celsius).
– [!] When in doubt, re-derive from first principles.

6. Memory Management, Lifelong Learning, and Evaluation

MARS agents employ mechanisms for efficient memory use and knowledge accumulation. For example, memory optimization via Ebbinghaus forgetting curves retains high-utility reflections in short-term memory and demotes less-salient data to long-term storage, balancing context limitations and knowledge persistence (Liang et al., 25 Mar 2025). Lessons distilled from code reasoning or general tasks are periodically condensed and injected for future task context (e.g., MARCO framework (Zhao et al., 23 May 2025)).

Evaluation methodologies encompass:

7. Theoretical Generalizations and Open Challenges

MARS agents generalize across the line from extrinsic, human-prescribed meta-loops to intrinsic, agent-driven metacognitive learning. Explicit modeling of metacognitive knowledge, planning, and evaluation enables online adaptation to new tasks and environments, enhances scalability, and reduces reliance on hand-coded curricula (Liu et al., 5 Jun 2025). Theoretical frameworks such as emotion-gradient metacognitive RSI (EG-MRSI) introduce differentiable intrinsic motivation, formal safety constraints, and meaning-density metrics, advancing MARS toward theoretically grounded, open-ended self-improvement (Ando, 12 May 2025).

Challenges and directions include:

  • Optimizing division of metacognitive labor between human and agent,
  • Bootstrapping reliable metacognitive beliefs from unreliable or hallucinated priors,
  • Ensuring safety and reward alignment under autonomous policy evolution,
  • Extending MARS to multi-agent social learning and meta-level planning,
  • Scaling reflection under context and computational constraints (Liu et al., 5 Jun 2025, Hou et al., 17 Jan 2026, Ozer et al., 23 Dec 2025).

References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Metacognitive Agent Reflective Self-improvement (MARS).