MARS: Metacognitive Self-Improvement Agents

Updated 24 January 2026

MARS is a class of generative agent architectures that integrate a metacognitive layer for explicit introspection, self-evaluation, and autonomous learning.
They use a dual-process model where fast, intuitive actions are paired with slow, reflective policy revisions to optimize task performance.
MARS frameworks extend to multi-agent systems and employ advanced memory management and reflection techniques for robust self-improvement.

Metacognitive Agent Reflective Self-Improvement (MARS) refers to a class of generative agent architectures that possess explicit mechanisms for introspection, self-evaluation, strategy revision, and autonomous learning. MARS frameworks are designed to enable agents to significantly enhance their goal-directed performance by continuously observing, evaluating, and modifying their own cognitive processes, typically through a dedicated metacognitive layer. Inspired by both cognitive psychology (dual-process theory) and computational metacognition, MARS systems operationalize a model of self-improvement that is formal, introspective, and dynamically adaptive across diverse domains and tasks (Toy et al., 2024, Hou et al., 17 Jan 2026, Liu et al., 5 Jun 2025, Zhao et al., 23 May 2025, Ozer et al., 23 Dec 2025, Liang et al., 25 Mar 2025, 0807.4417, Cox et al., 2022, Bilal et al., 20 Apr 2025).

1. Formal Framework and Objectives

A MARS agent is characterized by the augmentation of standard generative policies with a metacognitive module responsible for explicit monitoring and adaptation of reasoning. At each timestep $t$ , the agent maintains:

State: $s_t$ (environment-agent state)
Observations: $o_t \in O$
Action: $a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$
Memory: $M_t$ (multiset of experiences, thoughts, meta-thoughts)
Self-evaluation: $e_t = E(s_{t-1}, a_t, M_{t-1})$ , with $E: S \times A \times M \rightarrow \mathbb{R}$

The optimization objective is to maximize cumulative self-evaluation,

$\max_\pi \sum_{t=1}^T e_t,$

where $e_t$ measures internally scored progress toward task completion or cognitive goals. The internal self-evaluation is not identical to an extrinsic reward; it is an agent-generated, context-aware signal that drives learning and adaptation (Toy et al., 2024, Liu et al., 5 Jun 2025).

Formally, MARS decomposes into three interacting components (Liu et al., 5 Jun 2025):

Metacognitive Knowledge ( $K$ ): structured beliefs over skills, tasks, and strategies,
Metacognitive Planning ( $s_t$ 0): selection of learning targets and strategies,
Metacognitive Evaluation ( $s_t$ 1): reflection on cognitive/learning outcomes to update $s_t$ 2 and $s_t$ 3.

2. Architectural Design: Dual-System and Multi-Agent Extensions

The canonical MARS agent instantiates a dual-process architecture:

System 1: Fast, habitual, and intuitive inference, operationalized as the standard LLM-driven policy $s_t$ 4. This system executes immediate action selection and basic reasoning.
System 2: Slow, deliberative, and reflective, implemented as a metacognitive controller. System 2 monitors System 1 outputs and internal states, triggers introspection when performance is sub-threshold, generates meta-questions (e.g., "How can I improve?"), and revises strategies, prompts, or weights.

Memory management is explicit: each memory $s_t$ 5 carries an embedding $s_t$ 6 and an importance weight. Memory is pruned via top- $s_t$ 7 importance rules to conform with context window constraints.

Extension to multi-agent MARS arises by distributing introspective functions across a set of specialized subagents—critics, judges, supervisors, debaters—enabling modular role assignments, structured disagreement, and learned aggregation of self-assessment (Ozer et al., 23 Dec 2025, Bilal et al., 20 Apr 2025).

3. Reflective Dynamics and Self-Improvement Cycle

MARS executes a recurrent cycle of action, monitoring, introspection, and policy revision:

Observe: Ingest new observation $s_t$ 8; update $s_t$ 9 with salient experiences.
Act: System 1 generates action $o_t \in O$ 0 using the current policy conditioned on $o_t \in O$ 1, $o_t \in O$ 2.
Self-Evaluate: System 2 computes $o_t \in O$ 3, scoring recent behavior based on internal criteria.
Introspect: If $o_t \in O$ 4, metacognitive routines engage. The agent generates meta-questions and synthesizes meta-thoughts (e.g., "What went wrong?"), which are appended to $o_t \in O$ 5.
Revise Policy: A new strategy prompt or policy update is synthesized and installed as $o_t \in O$ 6.
Stopping/Convergence: The cycle iterates until $o_t \in O$ 7 converges to threshold $o_t \in O$ 8 for $o_t \in O$ 9 steps or a maximum number of introspections is reached. Convergence is defined by $a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$ 0 with $a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$ 1 small (Toy et al., 2024).

Policy updates can be cast as minimization of a self-evaluation weighted loss: $a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$ 2 thus reinforcing actions that yield higher self-evaluation (Toy et al., 2024).

Pseudocode Summary (System 2 Loop): $a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$ 4 (Toy et al., 2024)

4. Multi-Agent MARS and Preventing Thought Degeneration

Multi-agent extensions of MARS, typified by MAR (Multi-Agent Reflexion), mitigate the degeneration of self-reflection by incorporating a pool of persona-specific critics and a judge. Each critic analyzes failed trials from distinct methodological perspectives (e.g., skepticism, verification, creativity), producing diverse reflections. The central judge aggregates these into a consensus, yielding more robust and diverse self-improvement updates (Ozer et al., 23 Dec 2025).

MAR Algorithmic Loop:

Actor generates solution,
Evaluator checks correctness,
Persona critics each reflect and generate their perspectives,
Judge synthesizes reflections,
Actor is prompted with consensus reflection for subsequent attempts.

Empirical benchmarks demonstrate that MAR (multi-agent) yields higher accuracy than single-agent reflexion, with HotPotQA EM increasing to 47% (vs 44% for single-agent reflexion and 32% for vanilla ReAct) (Ozer et al., 23 Dec 2025).

Tabular summary (HotPotQA, trial 5):

Method	EM (%)
Baseline (ReAct)	32.0
Reflexion	44.0
MAR (multi-agent)	47.0

5. Principle-Based and Procedural Metacognitive Reflection

Recent MARS frameworks synthesize human-inspired reflection modalities for efficient self-improvement (Hou et al., 17 Jan 2026):

Principle-Based Reflection: Abstracts normative avoidance rules from error clusters, providing explicit "what to avoid" enhancements (concise warnings or "dos/don'ts").
Procedural Reflection: Derives stepwise strategies from successful reasoning traces, formulating guides to "how to succeed" (reasoning checkpoints, algorithmic steps).

A single-cycle algorithm processes diagnostic failures, clusters error types, and distills both principle and procedural enhancements, which are incorporated into new prompts. This approach yields state-of-the-art or near state-of-the-art performance across benchmarks (e.g., DROP F1 plus 6.4 points over zero-shot baseline; MMLU plus 4.7 points over zero-shot-CoT base) with a fraction of the computational cost of recursive agents (Hou et al., 17 Jan 2026).

Example prompt with principle-based enhancement:

$a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$ 5

6. Memory Management, Lifelong Learning, and Evaluation

MARS agents employ mechanisms for efficient memory use and knowledge accumulation. For example, memory optimization via Ebbinghaus forgetting curves retains high-utility reflections in short-term memory and demotes less-salient data to long-term storage, balancing context limitations and knowledge persistence (Liang et al., 25 Mar 2025). Lessons distilled from code reasoning or general tasks are periodically condensed and injected for future task context (e.g., MARCO framework (Zhao et al., 23 May 2025)).

Evaluation methodologies encompass:

Task-based success rates (e.g., survival in complex environments, accuracy in QA/code reasoning),
Learning curves over repeated scenarios (e.g., monotonic improvement in internal evaluation $a_t \sim \pi_{t-1}(s_{t-1}, M_{t-1})$ 3),
Component analysis (ablation of meta-reflection, cross-referencing),
Outcome-based metrics on transfer, adaptation, and long-context or multi-tasking ability (Hou et al., 17 Jan 2026, Liang et al., 25 Mar 2025, Zhao et al., 23 May 2025, Ozer et al., 23 Dec 2025).

7. Theoretical Generalizations and Open Challenges

MARS agents generalize across the line from extrinsic, human-prescribed meta-loops to intrinsic, agent-driven metacognitive learning. Explicit modeling of metacognitive knowledge, planning, and evaluation enables online adaptation to new tasks and environments, enhances scalability, and reduces reliance on hand-coded curricula (Liu et al., 5 Jun 2025). Theoretical frameworks such as emotion-gradient metacognitive RSI (EG-MRSI) introduce differentiable intrinsic motivation, formal safety constraints, and meaning-density metrics, advancing MARS toward theoretically grounded, open-ended self-improvement (Ando, 12 May 2025).

Challenges and directions include:

Optimizing division of metacognitive labor between human and agent,
Bootstrapping reliable metacognitive beliefs from unreliable or hallucinated priors,
Ensuring safety and reward alignment under autonomous policy evolution,
Extending MARS to multi-agent social learning and meta-level planning,
Scaling reflection under context and computational constraints (Liu et al., 5 Jun 2025, Hou et al., 17 Jan 2026, Ozer et al., 23 Dec 2025).

References:

(Toy et al., 2024) Metacognition is all you need?
(Hou et al., 17 Jan 2026) Learn Like Humans: Use Meta-cognitive Reflection for Efficient Self-Improvement
(Ozer et al., 23 Dec 2025) MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs
(Liu et al., 5 Jun 2025) Truly Self-Improving Agents Require Intrinsic Metacognitive Learning
(Zhao et al., 23 May 2025) MARCO: Meta-Reflection with Cross-Referencing for Code Reasoning
(Liang et al., 25 Mar 2025) MARS: Memory-Enhanced Agents with Reflective Self-improvement
(Ando, 12 May 2025) Emotion-Gradient Metacognitive RSI (Part I): Theoretical Foundations
(Cox et al., 2022) Computational Metacognition
(0807.4417) On Introspection, Metacognitive Control and Augmented Data Mining Live Cycles
(Bilal et al., 20 Apr 2025) Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey