Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoCoMo Dataset: Multi-Modal Long-Term Dialogue

Updated 19 January 2026
  • LoCoMo is a large-scale multi-modal corpus crafted to evaluate long-term memory and reasoning in LLM agents using extended, session-based dialogues.
  • It is constructed via a dual machine–human pipeline that generates detailed persona profiles, temporal event graphs, and multi-session interactions ensuring factual and dialogic consistency.
  • The dataset supports benchmarking on tasks such as question answering, event summarization, and multi-modal dialogue generation, evaluated with metrics like F1, ROUGE, and BLEU.

The LoCoMo dataset is a large-scale, multi-modal corpus specifically constructed for the evaluation of long-term memory and reasoning capabilities in LLM agents. Centered on very long-term, multi-session dialogues enriched with structured event graphs and natural image sharing, LoCoMo provides a comprehensive benchmark suite covering question answering, event summarization, and multi-modal dialogue generation. The dataset is constructed using an LLM-driven, human-verified pipeline to ensure both broad coverage of conversational phenomena and fine-grained factual consistency, and is accompanied by task-specific metrics, reference annotations, and integration scripts for efficient research use (Maharana et al., 2024).

1. Construction Pipeline

LoCoMo is produced using a two-stage machine–human process. First, machine-generated persona statements and temporal event graphs are constructed. Each persona commences as a brief seed (4–5 sentences) sourced from MSC (Xu et al., 2022), expanded via GPT-3.5-turbo into a detailed persona profile (name, age, objectives, relationships, habits). Subsequently, GPT-3 (text-davinci-003) generates a temporal event graph GG of up to 25 causally linked life events distributed across a 6–12 month timeline, with timestamps and directed causal edges.

Conversation dialogues are generated by paired LLM agents (gpt-3.5-turbo), each assigned a distinct persona and associated event graph. The agents interact over multiple sessions, employing dedicated memory representations:

  • Short-term memory (HsH_s): After each session kk, a summary wkw_k is constructed from the raw turns hkh_k, conditioned on the prior summary.
  • Long-term memory (HH_\ell): Each conversational turn hk,jh_{k,j} is abstracted to an “assertive fact” observation ok,jo_{k,j} and cross-referenced with dialogue context and event graph structure.

At each session, the agent’s response is conditioned not only on its persona and recent history but also session summaries, retrieved observations, current session context, and temporally relevant events from GG.

A multi-modal module enables agents to share or react to images, prompting for captions (via BLIP-2), keyword extraction, and image retrieval through web crawling. Human annotators complete the process by reviewing for dialogic consistency, image relevance, and fidelity of event graph grounding. Edits address inconsistencies, irrelevant or mismatched images, and incomplete event alignment. Approximately 15% of turns and 19% of generated images are amended or excluded during this phase (Maharana et al., 2024).

2. Dataset Characteristics

LoCoMo comprises 50 validated dialogues, each spanning an average of 19.3 sessions (SavgS_{avg}), with approximately 15.8 turns per session (javgj_{avg}), leading to over 300 turns per dialogue (Tavg304.9T_{avg} \approx 304.9) and an average dialogue length of 9,209 tokens. Each dialogue integrates roughly 32.3 images (IavgI_{avg}), with both image-sharing and image-reaction turns distributed throughout the conversation.

The question-answering benchmark encompasses 7,512 questions, categorized as follows:

QA Type Number Fraction
Single-hop 2,705 36.0%
Multi-hop 1,104 14.6%
Temporal 1,547 20.6%
Open-domain 285 3.9%
Adversarial 1,871 24.9%

Event summarization references an average of 24.2 ground-truth events per dialogue, with comprehensive summaries averaging 896.5 tokens (Maharana et al., 2024).

3. Annotation, Verification, and Quality Control

Human annotation in LoCoMo is multi-faceted:

  • Long-range consistency: Annotators edit or remove dialogue turns that violate established personal facts or introduce session-inconsistent state changes.
  • Image alignment: Non-germane, speculative, or contextually mismatched images are replaced or excised; explicit image reactions are added if missing.
  • Event grounding: All conversationally referenced life events are matched to event graph entries; unused events are purged or retrofitted if discovered in dialogue.

The annotation process ensures high alignment between depicted life narratives and underlying structured data, yielding a high-quality, multi-session, multi-modal dialogue resource. No less than 15% of turns undergo some form of edit, especially for factual corrections and coherence (Maharana et al., 2024).

4. Multi-Modal Integration

LoCoMo’s dialogues interleave rich visual stimuli throughout multi-turn textual exchanges. Image-sharing is driven by agent intention and caption-based retrieval, with recipients employing BLIP-2 for captioning and subsequent dialogic response generation. On average, each dialogue incorporates over 32 images, substantially increasing the complexity and authenticity of long-term conversational dynamics.

For annotation and evaluation, all images are paired with descriptive captions, which are used to substitute images during QA and event summarization tasks, addressing potential reproducibility or evaluation noise. Multi-modal dialogue generation evaluation requires models to generate both text and contextually appropriate image selections or descriptions, with multi-modal semantic alignment measured via the MMRelevance metric (Maharana et al., 2024).

5. Benchmark Tasks and Evaluation Metrics

LoCoMo offers three principal evaluation axes:

  • Long-term Question Answering (QA): Tasked with answering questions about the full dialogue, models are evaluated via normalized token-level overlap F1 between generated and gold answers. QA questions span single-hop, multi-hop, temporal, open-domain, and adversarial categories.
  • Event Summarization: For a given time interval, systems must extract and order life events as referenced in the conversation. Evaluation uses ROUGE-1/2/L lexical overlap and FactScore, where both system and reference are atomized into factual units.
  • Multi-Modal Dialogue Generation: Given context up to turn tt, systems generate the next turn (text plus optionally an image). BLEU-1/2, ROUGE-L, and MMRelevance metrics are used.

Unlike many datasets, LoCoMo does not provide separate train/validation/test splits. All 50 validated dialogues, 7,512 QA pairs, and event summaries are designated for evaluation; synthetic data subsets are available for training multi-modal models (Maharana et al., 2024).

6. Dataset Access, Format, and Tools

The dataset is distributed in structured JSON format (one file per dialogue), containing dialogue metadata, agent personas, temporal event graphs, sessions with turn-level detail (speaker, text, optional image references, captions), and explicit memory representations (summaries, long-term observations).

  • Access: Released under CC BY-NC 4.0 at https://snap‐research.github.io/locomo.
  • Documentation and APIs: Python APIs, loading scripts, and example notebooks are provided as part of the codebase, facilitating integration into research workflows.
  • Evaluation Scripts: Benchmarking tools are available for all major tasks with standardized metrics (Maharana et al., 2024).

7. Empirical Insights and Research Guidance

Experimental results using LoCoMo demonstrate clear challenges for both short- and long-context LLMs on very long-form memory tasks:

  • LLMs with 4K-token windows achieve 1732%17\text{–}32\% F1 on QA, versus human baseline of 87.9%87.9\%.
  • Long-context variants (GPT-3.5-16K) attain modest gains (37.8%\sim37.8\% F1) but remain highly susceptible to adversarial queries (2.1%2.1\% F1).
  • Retrieval-Augmented Generation (RAG) using “observations” as memory units yields the best QA performance (41.4%41.4\% F1 at top-5\text{top-5}).
  • Event summarization is most accurately performed by GPT-4-turbo (FactScore: 51.6%51.6\% precision, 41.8%41.8\% recall).
  • Multi-modal dialogue generation benefits measurably from the integration of retrieved observations (+1.6+1.6 BLEU-1, +1.2+1.2 MMRelevance).

Recommended best practices include leveraging retrieval over abstracted “observations,” combining short-term summaries with event graph structure for more coherent narrative maintenance, and holistic evaluation across QA, summarization, and multi-modal tasks to capture the full spectrum of long-term agent memory. LoCoMo is positioned as a resource for both benchmarking and fine-tuning memory-augmented LLMs and for designing new long-range memory architectures (Maharana et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoCoMo Dataset.