Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

106 tokens/sec

GPT-4o

61 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

4 3

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory (2410.10813v1)

Published 14 Oct 2024 in cs.CL

Abstract: Recent LLM-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

PDF HTML Abstract

Evaluating Long-Term Interactive Memory in Chat Assistants: A Detailed Examination

The paper introduces LONG M EM EVAL, a benchmark designed to evaluate the long-term memory capabilities of chat assistants. This benchmark assesses several crucial abilities including information extraction, cross-session reasoning, temporal reasoning, knowledge updates, and abstention, which collectively represent the core components of long-term memory systems desired in conversational AI.

Key Components of the Benchmark

LONG M EM EVAL consists of 500 high-quality questions organized around five core memory abilities. The questions are embedded in simulated user-assistant chat histories designed with extensible context lengths. The benchmark provides different configurations, with contexts reaching up to 1.5 million tokens. Preliminary results show a significant performance challenge for current memory systems, with long-context LLMs experiencing as much as a 60% accuracy drop depending on the configuration used.

Notable Findings and Memory System Analysis

Through the use of LONG M EM EVAL, the paper identifies significant performance gaps in existing memory-augmented chat assistant systems. Commercial solutions and state-of-the-art LLMs exhibit noticeable deficiencies, especially with tasks involving the synthesis of information across multiple sessions or integrating temporal and updated knowledge into the reasoning process.

The evaluation results indicate that despite advancements, the major obstacle for current systems lies in the unreliable integration and retrieval of long-term information, which is crucial for a personalized user experience. Existing systems often struggle to handle information dynamism and fail to accurately track and incorporate evolving user knowledge.

Proposed Optimizations for Memory-Augmented Systems

The paper proposes a unified framework for memory-augmented chat assistants, structured around three stages—indexing, retrieval, and reading. Key innovations include:

Session Decomposition: Storing interactions as rounds rather than sessions to improve granularity and retrieval efficiency.
Fact-Augmented Key Expansion: Leveraging extracted user facts to enhance indexing, aiding in a more targeted retrieval of memory.
Time-Aware Query and Retrieval: Introducing a mechanism to use temporal metadata to narrow down the retrieval scope for temporal reasoning questions.
Advanced Reading Strategies: Utilizing techniques such as the Chain-of-Note, which involves a step-by-step processing of retrieved information, and structured prompt formats for improving the extraction and reasoning stages.

These developments aim at increasing both the effectiveness of long-term memory retrieval and the downstream task performance. Practical implementations of these strategies demonstrate increased recall by 4% and accuracy up to 11% on temporal reasoning tasks.

Implications and Future Directions

The research presents a comprehensive benchmark that not only serves as a tool for evaluating and training AI systems but also poses a significant step towards understanding the complex requirements of long-term interactions within conversational applications. By providing holistic coverage of memory capabilities, LONG M EM EVAL facilitates the development and testing of more advanced AI systems equipped to handle personalized conversation over extended periods.

The findings and innovations in this paper underline the necessity for continued exploration into efficient memory mechanisms that can maintain user context over long periods, incentivizing new lines of research in scalable memory architectures and integration strategies. Future developments are positioned towards achieving highly personalized, context-aware, and memory-efficient conversational agents that can operate reliably in real-world dynamic scenarios. The public release of the benchmark promises to foster further progress and contribute to the evolution of conversational AI with robust long-term memory functions.

PDF Markdown Bookmark Chat (Pro)

References (48)

Authors (6)

Di Wu (477 papers)
Hongwei Wang (150 papers)
Wenhao Yu (139 papers)
Yuwei Zhang (48 papers)
Kai-Wei Chang (292 papers)
Dong Yu (328 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/DiWu0162/status/1910437341501743335

https://twitter.com/arXivGPT/status/1846659112806777328

YouTube

Show All Videos

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory (2410.10813v1)

Evaluating Long-Term Interactive Memory in Chat Assistants: A Detailed Examination

Related Papers

Tweets

YouTube