M3-Agent: Multimodal Cognitive Framework
- M3-Agent is a multimodal system that integrates visual and auditory inputs to build episodic and semantic memory, enabling robust cross-modal reasoning.
- It employs a dual-process architecture that separates continual memory formation from a control process that supports iterative, reinforcement learning–optimized task solving.
- Empirical evaluations on M3-Bench indicate significant accuracy improvements over state-of-the-art prompting agents, emphasizing the benefits of long-horizon memory retention and multi-turn retrieval.
M3-Agent is a multimodal agent framework equipped with long-term memory, capable of performing real-time perception, memory formation, and iterative reasoning. The architecture is designed to emulate aspects of human cognition—specifically, the processes of seeing, listening, remembering, and reasoning—in an autonomous computational agent that operates over extended temporal horizons in dynamic, real-world environments (Long et al., 13 Aug 2025).
1. Architectural Principles
The M3-Agent architecture operates via two parallel modules: the memorization process and the control process.
- Memorization Process: This module accepts a continuous stream of visual and auditory inputs, extracting both fine-grained event representations and high-level semantic knowledge. Episodic memory is constructed from atomic multimodal events (e.g., physical actions, spoken utterances, environmental state changes). Semantic memory abstracts over these, capturing relationships, world knowledge, character preferences, and entities’ roles. External tools such as facial recognition and speaker identification are integrated, supporting robust entity association.
- Control Process: When a task instruction is received, the control process leverages the long-term memory store for iterative, multi-turn reasoning. Specialized search operators (e.g.,
search_node
,search_clip
) enable retrieval of relevant entities, time segments, or contextual knowledge. The process is fully autonomous—no human-in-the-loop—iteratively calling search tools and updating the dialogue state, terminating only when an answer is produced.
This separation allows for scalable memory accumulation and principled interaction between perception, memory, and task-solving policies.
2. Multimodal Long-Term Memory: Structure and Representation
M3-Agent maintains its memory as a multimodal, entity-centric graph:
- Nodes: Encapsulate entity attributes acquired through diverse modalities (faces, voices, textual utterances).
- Edges: Encode temporal, relational, and semantic connections between entities and events.
Two main memory types are maintained:
- Episodic Memory: Concrete multimodal events. For instance, an entry might reflect “Alice takes coffee and says: ‘I can't go without this in the morning.’”
- Semantic Memory: Abstracted knowledge extracted from episodic data, e.g., “Alice prefers coffee in the morning,” relationships between subjects, roles, and scene-level attributes.
Entity-centric linkage (e.g., connecting a face node and a voice node to the same person) ensures multimodal consistency for cross-modal reasoning, supporting robust retrieval and inference over long time spans.
3. Iterative, Multi-Turn Reasoning and Retrieval
The reasoning strategy is multi-turn and policy-driven:
- The policy model, , starts with a system prompt and the task instruction.
- For each turn, the model generates either a
[Search]
action (invoking search tools to query the memory graph) or a[Answer]
action (terminating the loop). - Algorithmically, the scheme is:
1 2 3 4 5 6 7 |
for i = 0 to H-1 do τ_i ← π_θ(dialogue history) if τ_i indicates "[Search]" then memory ← Search(MEM, query) append memory + prompt to dialogue history else break (when "[Answer]" is produced) end for |
- Each trajectory is scored by an automatic judge (GPT-4o evaluator); the reinforcement learning update optimizes the policy via DAPO (a PPO variant), maximizing (with clipping on ratios), where is the normalized advantage.
This multi-round retrieval and memory-updating process is critical for solving long-horizon, multi-hop, and cross-modal questions that require accumulating evidence and reasoning beyond any single snapshot or retrieval.
4. M3-Bench: Long-Video QA Benchmark
To evaluate agent memory and reasoning, M3-Agent is assessed on M3-Bench, comprising:
- M3-Bench-robot: 100 newly recorded real-world videos from a robot's perspective (average length >30 min), annotated for testing memory retention, human understanding, and cross-modal inference.
- M3-Bench-web: 929 curated web-sourced videos across diverse scenarios.
Each video is paired with complex question–answer pairs spanning multi-detail reasoning, multi-hop reasoning, cross-modal reasoning, human understanding, and general knowledge. This benchmark is designed to challenge the agent’s ability to accumulate and utilize long-term memory for task completion.
5. Quantitative Results and Ablation Studies
Experiments reveal that M3-Agent, when trained with reinforcement learning, surpasses strongest baselines (prompting agents using Gemini-1.5-Pro and GPT-4o):
- Accuracy Improvements: +6.7% (M3-Bench-robot), +7.7% (M3-Bench-web), +5.3% (VideoMME-long) over the best prompting agent.
- Ablation Findings:
- Removing semantic memory leads to 13–19% accuracy drops on question subsets.
- Disabling multi-turn reasoning or inter-turn instruction further degrades performance.
The improvement is statistically significant and traced to both memory structure and iterative retrieval.
6. Practical Principles and Implications
M3-Agent advances the state of the art in several key dimensions:
- Entity-centric, multimodal long-term memory: Enables deep contextual understanding and robust cross-modal alignment, analogous to human cognitive faculties.
- Iterative, RL-optimized reasoning: Efficiently extracts evidence from large, temporally extended memory stores, scaling to real-world durations.
- Integrated training and evaluation protocol: The combination of M3-Bench and an automatic evaluator provides rigorous assessment of reasoning capacity in realistic settings.
These properties suggest that M3-Agent is well-suited for deployment in applications requiring real-time, context-aware perception and decision-making over extended periods, such as service robotics, digital personal assistants, and autonomous monitoring where consistency and memory robustness are indispensable.
7. Summary Table: Benchmarks and Accuracy
Benchmark | Agent Type | Accuracy Improvement |
---|---|---|
M3-Bench-robot | Reinforced M3-Agent | +6.7% |
M3-Bench-web | Reinforced M3-Agent | +7.7% |
VideoMME-long | Reinforced M3-Agent | +5.3% |
All improvements are compared to the strongest prompting baselines (Gemini-1.5-Pro, GPT-4o).
M3-Agent sets a precedent in multimodal agent architectures with long-term memory and iterative reasoning, demonstrating empirically that human-like memory formation and retrieval mechanisms substantially enhance performance in complex, temporally extended question answering tasks (Long et al., 13 Aug 2025).