Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory (2410.10813v1)

Published 14 Oct 2024 in cs.CL
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Abstract: Recent LLM-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

Evaluating Long-Term Interactive Memory in Chat Assistants: A Detailed Examination

The paper introduces LONG M EM EVAL, a benchmark designed to evaluate the long-term memory capabilities of chat assistants. This benchmark assesses several crucial abilities including information extraction, cross-session reasoning, temporal reasoning, knowledge updates, and abstention, which collectively represent the core components of long-term memory systems desired in conversational AI.

Key Components of the Benchmark

LONG M EM EVAL consists of 500 high-quality questions organized around five core memory abilities. The questions are embedded in simulated user-assistant chat histories designed with extensible context lengths. The benchmark provides different configurations, with contexts reaching up to 1.5 million tokens. Preliminary results show a significant performance challenge for current memory systems, with long-context LLMs experiencing as much as a 60% accuracy drop depending on the configuration used.

Notable Findings and Memory System Analysis

Through the use of LONG M EM EVAL, the paper identifies significant performance gaps in existing memory-augmented chat assistant systems. Commercial solutions and state-of-the-art LLMs exhibit noticeable deficiencies, especially with tasks involving the synthesis of information across multiple sessions or integrating temporal and updated knowledge into the reasoning process.

The evaluation results indicate that despite advancements, the major obstacle for current systems lies in the unreliable integration and retrieval of long-term information, which is crucial for a personalized user experience. Existing systems often struggle to handle information dynamism and fail to accurately track and incorporate evolving user knowledge.

Proposed Optimizations for Memory-Augmented Systems

The paper proposes a unified framework for memory-augmented chat assistants, structured around three stages—indexing, retrieval, and reading. Key innovations include:

  1. Session Decomposition: Storing interactions as rounds rather than sessions to improve granularity and retrieval efficiency.
  2. Fact-Augmented Key Expansion: Leveraging extracted user facts to enhance indexing, aiding in a more targeted retrieval of memory.
  3. Time-Aware Query and Retrieval: Introducing a mechanism to use temporal metadata to narrow down the retrieval scope for temporal reasoning questions.
  4. Advanced Reading Strategies: Utilizing techniques such as the Chain-of-Note, which involves a step-by-step processing of retrieved information, and structured prompt formats for improving the extraction and reasoning stages.

These developments aim at increasing both the effectiveness of long-term memory retrieval and the downstream task performance. Practical implementations of these strategies demonstrate increased recall by 4% and accuracy up to 11% on temporal reasoning tasks.

Implications and Future Directions

The research presents a comprehensive benchmark that not only serves as a tool for evaluating and training AI systems but also poses a significant step towards understanding the complex requirements of long-term interactions within conversational applications. By providing holistic coverage of memory capabilities, LONG M EM EVAL facilitates the development and testing of more advanced AI systems equipped to handle personalized conversation over extended periods.

The findings and innovations in this paper underline the necessity for continued exploration into efficient memory mechanisms that can maintain user context over long periods, incentivizing new lines of research in scalable memory architectures and integration strategies. Future developments are positioned towards achieving highly personalized, context-aware, and memory-efficient conversational agents that can operate reliably in real-world dynamic scenarios. The public release of the benchmark promises to foster further progress and contribute to the evolution of conversational AI with robust long-term memory functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Phi-3 technical report: A highly capable language model locally on your phone. CoRR, abs/2404.14219, 2024. doi: 10.48550/ARXIV.2404.14219. URL https://doi.org/10.48550/arXiv.2404.14219.
  2. Make your LLM fully utilize the context. CoRR, abs/2404.16811, 2024. doi: 10.48550/ARXIV.2404.16811. URL https://doi.org/10.48550/arXiv.2404.16811.
  3. Longformer: The long-document transformer. arXiv:2004.05150, 2020.
  4. Walking down the memory maze: Beyond context limit through interactive reading. CoRR, abs/2310.05029, 2023a. doi: 10.48550/ARXIV.2310.05029. URL https://doi.org/10.48550/arXiv.2310.05029.
  5. Dense X retrieval: What retrieval granularity should we use? CoRR, abs/2312.06648, 2023b. doi: 10.48550/ARXIV.2312.06648. URL https://doi.org/10.48550/arXiv.2312.06648.
  6. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  3829–3846. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.232. URL https://doi.org/10.18653/v1/2023.emnlp-main.232.
  7. Coze. Memory overview guide. https://www.coze.com/docs/guides/memory_overview?_lang=en, 2024. Accessed: September 15, 2024.
  8. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  9. Perltqa: A personal long-term memory dataset for memory classification, retrieval, and synthesis in question answering. CoRR, abs/2402.16288, 2024. doi: 10.48550/ARXIV.2402.16288. URL https://doi.org/10.48550/arXiv.2402.16288.
  10. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  11. Improving retrieval of short texts through document expansion. In William R. Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson (eds.), The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, 2012, pp.  911–920. ACM, 2012. doi: 10.1145/2348283.2348405. URL https://doi.org/10.1145/2348283.2348405.
  12. Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=TaAqeo7lUh.
  13. Hipporag: Neurobiologically inspired long-term memory for large language models. CoRR, abs/2405.14831, 2024. doi: 10.48550/ARXIV.2405.14831. URL https://doi.org/10.48550/arXiv.2405.14831.
  14. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0.
  15. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp.  13358–13376. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.825. URL https://doi.org/10.18653/v1/2023.emnlp-main.825.
  16. Gregory Kamradt. Needle in a haystack - pressure testing llms. GitHub, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
  17. Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents, 2024. URL https://arxiv.org/abs/2406.13144.
  18. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  19. Hello again! llm-powered personalized agent for long-term dialogue. CoRR, abs/2406.05925, 2024. doi: 10.48550/ARXIV.2406.05925. URL https://doi.org/10.48550/arXiv.2406.05925.
  20. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12:157–173, 2024. doi: 10.1162/TACL“˙A“˙00638. URL https://doi.org/10.1162/tacl_a_00638.
  21. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2511–2522, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.153. URL https://aclanthology.org/2023.emnlp-main.153.
  22. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13851–13870, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.747. URL https://aclanthology.org/2024.acl-long.747.
  23. Microsoft. Announcing microsoft copilot, your everyday ai companion, 2023. URL https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/. Accessed: September 15, 2024.
  24. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/3d77c6dcc7f143aa2154e7f4d5e22d68-Abstract-Conference.html.
  25. MTEB: Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2014–2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.148. URL https://aclanthology.org/2023.eacl-main.148.
  26. OpenAI. Chatgpt, 2022. URL https://chat.openai.com/chat. Accessed: September 15, 2024.
  27. OpenAI. Memory and new controls for chatgpt. https://openai.com/index/memory-and-new-controls-for-chatgpt/, 2024. Accessed: September 15, 2024.
  28. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, 2009. doi: 10.1561/1500000019. URL https://doi.org/10.1561/1500000019.
  29. RAPTOR: recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=GN921JHCRw.
  30. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  31210–31227. PMLR, 2023. URL https://proceedings.mlr.press/v202/shi23a.html.
  31. REPLUG: retrieval-augmented black-box language models. In Kevin Duh, Helena Gómez-Adorno, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pp.  8371–8384. Association for Computational Linguistics, 2024a. doi: 10.18653/V1/2024.NAACL-LONG.463. URL https://doi.org/10.18653/v1/2024.naacl-long.463.
  32. REPLUG: Retrieval-augmented black-box language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  8371–8384, Mexico City, Mexico, June 2024b. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.463. URL https://aclanthology.org/2024.naacl-long.463.
  33. Language model information retrieval with document expansion. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson (eds.), Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pp.  407–414, New York City, USA, June 2006. Association for Computational Linguistics. URL https://aclanthology.org/N06-1052.
  34. Augmenting language models with long-term memory. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/ebd82705f44793b6f9ade5a669d0f0bf-Abstract-Conference.html.
  35. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 36, 2024.
  36. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
  37. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  38. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6268–6278, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.385. URL https://aclanthology.org/2023.emnlp-main.385.
  39. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=mlJLVigNHp.
  40. Beyond goldfish memory: Long-term open-domain conversation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5180–5197, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.356. URL https://aclanthology.org/2022.acl-long.356.
  41. Long time no see! open-domain conversation with long-term persona memory. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  2639–2650, Dublin, Ireland, May 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.207. URL https://aclanthology.org/2022.findings-acl.207.
  42. Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  3063–3079. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.172. URL https://doi.org/10.18653/v1/2023.acl-long.172.
  43. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023.
  44. Dun Zhang. STELLA EN 1.5B v5. https://huggingface.co/dunzhang/stella_en_1.5B_v5, 2023. Accessed: September 15, 2024.
  45. Cognitive kernel: An open-source agent system towards generalist autopilots, 2024. URL https://arxiv.org/abs/2409.10277.
  46. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  47. Memorybank: Enhancing large language models with long-term memory. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pp.  19724–19731. AAAI Press, 2024. doi: 10.1609/AAAI.V38I17.29946. URL https://doi.org/10.1609/aaai.v38i17.29946.
  48. Training language models with memory augmentation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5657–5673, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.382. URL https://aclanthology.org/2022.emnlp-main.382.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Di Wu (477 papers)
  2. Hongwei Wang (150 papers)
  3. Wenhao Yu (139 papers)
  4. Yuwei Zhang (48 papers)
  5. Kai-Wei Chang (292 papers)
  6. Dong Yu (328 papers)
Citations (1)
Youtube Logo Streamline Icon: https://streamlinehq.com