M3-Agent: Multimodal Cognitive Framework

Updated 14 August 2025

M3-Agent is a multimodal system that integrates visual and auditory inputs to build episodic and semantic memory, enabling robust cross-modal reasoning.
It employs a dual-process architecture that separates continual memory formation from a control process that supports iterative, reinforcement learning–optimized task solving.
Empirical evaluations on M3-Bench indicate significant accuracy improvements over state-of-the-art prompting agents, emphasizing the benefits of long-horizon memory retention and multi-turn retrieval.

M3-Agent is a multimodal agent framework equipped with long-term memory, capable of performing real-time perception, memory formation, and iterative reasoning. The architecture is designed to emulate aspects of human cognition—specifically, the processes of seeing, listening, remembering, and reasoning—in an autonomous computational agent that operates over extended temporal horizons in dynamic, real-world environments (Long et al., 13 Aug 2025).

1. Architectural Principles

The M3-Agent architecture operates via two parallel modules: the memorization process and the control process.

Memorization Process: This module accepts a continuous stream of visual and auditory inputs, extracting both fine-grained event representations and high-level semantic knowledge. Episodic memory is constructed from atomic multimodal events (e.g., physical actions, spoken utterances, environmental state changes). Semantic memory abstracts over these, capturing relationships, world knowledge, character preferences, and entities’ roles. External tools such as facial recognition and speaker identification are integrated, supporting robust entity association.
Control Process: When a task instruction is received, the control process leverages the long-term memory store for iterative, multi-turn reasoning. Specialized search operators (e.g., search_node, search_clip) enable retrieval of relevant entities, time segments, or contextual knowledge. The process is fully autonomous—no human-in-the-loop—iteratively calling search tools and updating the dialogue state, terminating only when an answer is produced.

This separation allows for scalable memory accumulation and principled interaction between perception, memory, and task-solving policies.

2. Multimodal Long-Term Memory: Structure and Representation

M3-Agent maintains its memory as a multimodal, entity-centric graph:

Nodes: Encapsulate entity attributes acquired through diverse modalities (faces, voices, textual utterances).
Edges: Encode temporal, relational, and semantic connections between entities and events.

Two main memory types are maintained:

Episodic Memory: Concrete multimodal events. For instance, an entry might reflect “Alice takes coffee and says: ‘I can't go without this in the morning.’”
Semantic Memory: Abstracted knowledge extracted from episodic data, e.g., “Alice prefers coffee in the morning,” relationships between subjects, roles, and scene-level attributes.

Entity-centric linkage (e.g., connecting a face node and a voice node to the same person) ensures multimodal consistency for cross-modal reasoning, supporting robust retrieval and inference over long time spans.

3. Iterative, Multi-Turn Reasoning and Retrieval

The reasoning strategy is multi-turn and policy-driven:

The policy model, $\pi_{\theta}$ , starts with a system prompt and the task instruction.
For each turn, the model generates either a [Search] action (invoking search tools to query the memory graph) or a [Answer] action (terminating the loop).
Algorithmically, the scheme is:

for i = 0 to H-1 do
    τ_i ← π_θ(dialogue history)
    if τ_i indicates "[Search]" then
        memory ← Search(MEM, query)
        append memory + prompt to dialogue history
    else break (when "[Answer]" is produced)
end for

Each trajectory is scored by an automatic judge (GPT-4o evaluator); the reinforcement learning update optimizes the policy via DAPO (a PPO variant), maximizing $\mathbb{E}[(\pi_{\theta}(\tau_{i,t}) /\pi_{\theta}^{\text{old}}(\tau_{i,t}))\cdot \hat{A}_{i,t}]$ (with clipping on ratios), where $\hat{A}_{i,t}$ is the normalized advantage.

This multi-round retrieval and memory-updating process is critical for solving long-horizon, multi-hop, and cross-modal questions that require accumulating evidence and reasoning beyond any single snapshot or retrieval.

4. M3-Bench: Long-Video QA Benchmark

To evaluate agent memory and reasoning, M3-Agent is assessed on M3-Bench, comprising:

M3-Bench-robot: 100 newly recorded real-world videos from a robot's perspective (average length >30 min), annotated for testing memory retention, human understanding, and cross-modal inference.
M3-Bench-web: 929 curated web-sourced videos across diverse scenarios.

Each video is paired with complex question–answer pairs spanning multi-detail reasoning, multi-hop reasoning, cross-modal reasoning, human understanding, and general knowledge. This benchmark is designed to challenge the agent’s ability to accumulate and utilize long-term memory for task completion.

5. Quantitative Results and Ablation Studies

Experiments reveal that M3-Agent, when trained with reinforcement learning, surpasses strongest baselines (prompting agents using Gemini-1.5-Pro and GPT-4o):

Accuracy Improvements: +6.7% (M3-Bench-robot), +7.7% (M3-Bench-web), +5.3% (VideoMME-long) over the best prompting agent.
Ablation Findings:
- Removing semantic memory leads to 13–19% accuracy drops on question subsets.
- Disabling multi-turn reasoning or inter-turn instruction further degrades performance.

The improvement is statistically significant and traced to both memory structure and iterative retrieval.

6. Practical Principles and Implications

M3-Agent advances the state of the art in several key dimensions:

Entity-centric, multimodal long-term memory: Enables deep contextual understanding and robust cross-modal alignment, analogous to human cognitive faculties.
Iterative, RL-optimized reasoning: Efficiently extracts evidence from large, temporally extended memory stores, scaling to real-world durations.
Integrated training and evaluation protocol: The combination of M3-Bench and an automatic evaluator provides rigorous assessment of reasoning capacity in realistic settings.

These properties suggest that M3-Agent is well-suited for deployment in applications requiring real-time, context-aware perception and decision-making over extended periods, such as service robotics, digital personal assistants, and autonomous monitoring where consistency and memory robustness are indispensable.

7. Summary Table: Benchmarks and Accuracy

Benchmark	Agent Type	Accuracy Improvement
M3-Bench-robot	Reinforced M3-Agent	+6.7%
M3-Bench-web	Reinforced M3-Agent	+7.7%
VideoMME-long	Reinforced M3-Agent	+5.3%

All improvements are compared to the strongest prompting baselines (Gemini-1.5-Pro, GPT-4o).

M3-Agent sets a precedent in multimodal agent architectures with long-term memory and iterative reasoning, demonstrating empirically that human-like memory formation and retrieval mechanisms substantially enhance performance in complex, temporally extended question answering tasks (Long et al., 13 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to M3-Agent.