Unified multimodal memory storage and cross-modal retrieval for agents

Develop a unified multimodal memory storage framework for LLM-driven agents that supports text, images, audio, and video, and design cross-modal retrieval and reasoning mechanisms that operate over this unified store to enable seamless multimodal memory utilization.

Background

The paper introduces agent skills as modular capability extensions that encapsulate instructions, scripts, and resources into reusable units, enabling general-purpose agents to acquire domain-specific expertise at runtime. While early efforts propose standardized memory operation languages and shared knowledge platforms to promote portability and interoperability, the authors emphasize that the field remains nascent and faces several critical challenges.

A central unresolved issue is the lack of a unified storage framework that can natively support multimodal information across text, images, audio, and video, coupled with the need for cross-modal retrieval and reasoning mechanisms. Current memory systems are predominantly designed for textual modalities, which limits seamless integration, retrieval, and utilization of multimodal memories in complex, real-world environments.

References

Although these works demonstrate the potential of this research direction, the field remains in its nascent stage, with numerous critical challenges awaiting further investigation: (1) Unified Storage and Representation of Multimodal Information: Current memory systems are predominantly designed for textual modalities. How to construct a unified storage framework that supports multimodal information encompassing text, images, audio, and video, while simultaneously designing cross-modal retrieval and reasoning mechanisms, remains an open research question, and (2) Cross-agent skill transfer and adaptation mechanisms: Different agent architectures, such as those built upon distinct foundation models, exhibit variations in capability characteristics and interface specifications. Designing a universal skill description language along with an adaptation layer that enables skill modules to be seamlessly transferred and reused across heterogeneous agents constitutes a critical challenge for realizing a genuine skill-sharing ecosystem.

AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents  (2512.23343 - Liang et al., 29 Dec 2025) in Section 7.2 (Agent Skills)