Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Very Long-Term Conversational Memory of LLM Agents (2402.17753v1)

Published 27 Feb 2024 in cs.CL, cs.AI, and cs.LG
Evaluating Very Long-Term Conversational Memory of LLM Agents

Abstract: Existing works on long-term open-domain dialogues focus on evaluating model responses within contexts spanning no more than five chat sessions. Despite advancements in long-context LLMs and retrieval augmented generation (RAG) techniques, their efficacy in very long-term dialogues remains unexplored. To address this research gap, we introduce a machine-human pipeline to generate high-quality, very long-term dialogues by leveraging LLM-based agent architectures and grounding their dialogues on personas and temporal event graphs. Moreover, we equip each agent with the capability of sharing and reacting to images. The generated conversations are verified and edited by human annotators for long-range consistency and grounding to the event graphs. Using this pipeline, we collect LoCoMo, a dataset of very long-term conversations, each encompassing 300 turns and 9K tokens on avg., over up to 35 sessions. Based on LoCoMo, we present a comprehensive evaluation benchmark to measure long-term memory in models, encompassing question answering, event summarization, and multi-modal dialogue generation tasks. Our experimental results indicate that LLMs exhibit challenges in understanding lengthy conversations and comprehending long-range temporal and causal dynamics within dialogues. Employing strategies like long-context LLMs or RAG can offer improvements but these models still substantially lag behind human performance.

Evaluating Long-Term Memory Capabilities in LLMs through Extensive Conversational Analysis

Introduction

LLMs have demonstrated remarkable capabilities in generating human-like text across a range of applications. However, their effectiveness in handling very long-term dialogues remains relatively unexplored. To bridge this gap, we present a paper that leverages LLM-based agents to generate and analyze very long-term conversations. Through the introduction of the LoCoMo dataset, which consists of dialogues far exceeding the length and complexity of those previously studied, we establish a comprehensive benchmark for evaluating the long-term memory of conversational AI.

The LoCoMo Dataset

The LoCoMo dataset is unique in its depth and breadth, comprising 50 dialogues that extend over 300 turns and 9,000 tokens on average, spread across up to 35 sessions. Unlike existing conversational datasets, LoCoMo incorporates a multi-modal dimension with image sharing and reaction mechanisms, providing a richer context for dialogue. This dataset is generated through a novel machine-human pipeline ensuring high-quality, consistency, and grounding to predefined personas and temporal event graphs. These conversations emulate real-world interactions closely, making them a potent resource for researching very long-term memory in conversational agents.

Evaluation Framework

Our evaluation framework introduces three distinct tasks designed to test different facets of long-term memory and understanding within conversational models:

  1. Question Answering Task: This task assesses the model's ability to recall and integrate information across dialogues. It spans five reasoning categories, including single-hop, multi-hop, temporal, open-domain knowledge, and adversarial questions.
  2. Event Summarization Task: This evaluates the model's capacity to comprehend and summarize the causal and temporal dynamics depicted within the conversational event graphs.
  3. Multi-modal Dialogue Generation Task: This measures the model's proficiency in leveraging past dialogues and related context to generate consistent and relevant responses, also considering multi-modality (text and images).

Experimental Findings

Our experimental analysis reveals several insights into the current state of LLMs in comprehending and remembering information over long dialogues. While long-context LLMs and RAG strategies show promise, particularly in improving QA performance, they still substantially fall short of human-level understanding, especially in tasks requiring sophisticated temporal reasoning and the integration of complex dialogue history. Key findings include:

  • Long-context LLMs and RAG offer improvements in QA tasks but lag significantly in areas such as adversarial questioning and event graph summarization.
  • Base LLMs struggle with maintaining consistency over lengthy dialogues, often failing to correctly utilize their context.
  • Incorporating elements from the multi-modal dialogues enhances conversational agents' ability to produce more relevant and consistent outputs.

Future Directions

The research underscores the need for further advancements in LLMs to effectively model and understand the intricacies of very long-term conversational memory. Future developments may focus on enhancing contextual understanding and the integration of multi-modal data. Additionally, exploring methods to improve the robustness of conversational agents against adversarial inputs and to better capture temporal and causal relationships in dialogues could be fruitful avenues.

Conclusion

Our paper pushes the boundary of current conversational AI research by focusing on very long-term dialogues and introducing the LoCoMo dataset as a benchmark for evaluating the long-term memory capabilities of LLMs. The findings highlight significant challenges in modeling extensive conversational contexts and point towards the necessity for novel methods that can effectively manage and utilize long-term conversational memories.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. MPCHAT: Towards multimodal persona-grounded conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3354–3377, Toronto, Canada. Association for Computational Linguistics.
  2. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 520–534.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  4. Jan Assmann and John Czaplicka. 1995. Collective memory and cultural identity. New german critique, (65):125–133.
  5. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36.
  6. Booookscore: A systematic exploration of book-length summarization in the era of llms. In The Twelfth International Conference on Learning Representations.
  7. Summscreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615.
  8. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations.
  9. Alan Cooper. 1999. The inmates are running the asylum. Springer.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  11. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335.
  12. MMDialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7348–7363, Toronto, Canada. Association for Computational Linguistics.
  13. PeaCoK: Persona commonsense knowledge for consistent and engaging narratives. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6569–6591, Toronto, Canada. Association for Computational Linguistics.
  14. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  15. Deam: Dialogue coherence evaluation using amr-based semantic manipulations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 771–785.
  16. Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
  17. William Hirst and Gerald Echterhoff. 2012. Remembering in conversations: The social sharing and reshaping of memories. Annual review of psychology, 63:55–79.
  18. William Hirst and David Manier. 2008. Towards a psychology of collective memory. Memory, 16(3):183–200.
  19. Collective memory from a psychological perspective. Trends in cognitive sciences, 22(5):438–451.
  20. Faithful persona-based conversational dataset generation with large language models. arXiv preprint arXiv:2312.10007.
  21. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13584–13606, Singapore. Association for Computational Linguistics.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825.
  23. SODA: Million-scale dialogue distillation with social commonsense contextualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12930–12949, Singapore. Association for Computational Linguistics.
  24. CLEVR-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 582–595, Minneapolis, Minnesota. Association for Computational Linguistics.
  25. Longeval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1642–1661.
  26. Booksum: A collection of datasets for long-form narrative summarization. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6536–6558.
  27. Making large language models better data creators. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15349–15360, Singapore. Association for Computational Linguistics.
  28. Prompted LLMs as chatbot modules for long open-domain conversation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4536–4554, Toronto, Canada. Association for Computational Linguistics.
  29. Dialogcc: An automated pipeline for creating high-quality multi-modal dialogue datasets. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  30. Loogle: Can long-context language models understand long contexts? arXiv preprint arXiv:2311.04939.
  31. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning.
  32. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995.
  33. Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343.
  34. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  35. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6385–6400, Singapore. Association for Computational Linguistics.
  36. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore. Association for Computational Linguistics.
  37. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  38. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. arXiv preprint arXiv:2308.08239.
  39. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  40. Openvidial: A large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015.
  41. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  42. Image-grounded conversations: Multimodal context for natural question and response generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 462–472, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  43. I like fish, especially dolphins: Addressing contradictions in dialogue modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1699–1713.
  44. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  45. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA. Association for Computing Machinery.
  46. John Pruitt and Jonathan Grudin. 2003. Personas: practice and theory. In Proceedings of the 2003 conference on Designing for user experiences, pages 1–15.
  47. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
  48. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  49. Sketch-fill-a-R: A persona-grounded chit-chat generation framework. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 118–131, Online. Association for Computational Linguistics.
  50. Image-chat: Engaging grounded conversations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2414–2429, Online. Association for Computational Linguistics.
  51. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  53. Yuqing Wang and Yun Zhao. 2023. Tram: Benchmarking temporal reasoning for large language models. arXiv preprint arXiv:2310.00835.
  54. Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3731–3741, Florence, Italy. Association for Computational Linguistics.
  55. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  56. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  57. A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
  58. Beyond goldfish memory: Long-term open-domain conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5180–5197.
  59. PhotoChat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6142–6152, Online. Association for Computational Linguistics.
  60. Dynaeval: Unifying turn and dialogue level evaluation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5676–5689.
  61. Fined-eval: Fine-grained automatic dialogue-level evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3336–3355.
  62. Mind the gap between conversations for improved long-term dialogue generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10735–10762, Singapore. Association for Computational Linguistics.
  63. Emailsum: Abstractive email thread summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6895–6909.
  64. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  65. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
  66. MMChat: Multi-modal chat dataset on social media. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5778–5786, Marseille, France. European Language Resources Association.
  67. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250.
  68. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, 46(1):53–93.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Adyasha Maharana (13 papers)
  2. Dong-Ho Lee (30 papers)
  3. Sergey Tulyakov (108 papers)
  4. Mohit Bansal (304 papers)
  5. Francesco Barbieri (29 papers)
  6. Yuwei Fang (31 papers)
Citations (35)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com