TReMu: LoCoMo-Derived Multi-Session Dialogues

Updated 18 January 2026

The paper introduces TReMu, a framework that integrates explicit memory management and neuro-symbolic temporal reasoning to enhance long-term dialogue performance.
It employs persona grounding, temporal event graphs, and retrieval-augmented architectures to manage lengthy, multi-modal conversations effectively.
The benchmark tasks demonstrate improved QA accuracy and coherent event summarization, setting a new standard for temporal reasoning in dialogue systems.

LoCoMo-derived Multi-Session Dialogues (TReMu)

LoCoMo-derived Multi-Session Dialogues (TReMu) designate a suite of frameworks and benchmarks targeting the assessment and improvement of long-term, multi-session, and multimodal memory and temporal reasoning in LLM-driven dialogue agents. Leveraging the LoCoMo dataset—comprising dialogues with extensive chronological depth, grounded in explicit persona and event narrative structures—TReMu systems formalize both generative methodology and neuro-symbolic reasoning for robust temporal question answering, event summarization, and image-grounded conversation generation (Maharana et al., 2024, Ge et al., 3 Feb 2025).

1. Generation and Structure of LoCoMo Dialogues

LoCoMo dialogues are produced through a machine-human collaborative pipeline that generates dense, multi-session interactions between dual LLM-based agents. Each agent, initialized with a base LLM (e.g., GPT-3.5-turbo), is systematically conditioned on two major grounding elements:

Persona Grounding Module: An initial 4–5-sentence persona "seed" (from sources such as MSC) is expanded by the LLM into a comprehensive persona statement, encoding demographic, behavioral, and relationship traits.
Temporal Event Graph ( $\mathcal{G}$ ): Given the persona, LLMs generate a sequence of time-stamped events $(e_1, \ldots, e_k)$ spanning 6–12 months and connect them via a directed causal structure. These event graphs enforce narrative causal coherence and provide temporal anchors.

Dialogues span up to 35 sessions per conversation (mean ≈19.3), with each session averaging 15.8 turns, yielding long sequences (mean ≈305 turns, ≈9K tokens). Agent turns are further enriched through image-sharing and image-reaction behaviors: captions are generated, keywords are extracted, web images are retrieved and shared, and reactions to received images are produced using vision-LLMs (e.g., BLIP-2). Human annotators perform post-generation verification—removing misaligned images (~19%), correcting inconsistencies (~15% of turns), and pruning unused events—to ensure long-range consistency and authentic event grounding (Maharana et al., 2024).

2. Explicit Memory and Reflect-and-Respond Architecture

TReMu prescribes a principled architecture for explicit memory management, critical for fidelity in long-term mixed-initiative dialogues:

Short-Term Memory ( $\mathcal{H}_s$ ): Summarizes each session into a compact key (summary $w_k$ ).
Long-Term Memory ( $\mathcal{H}_l$ ): Stores granular atomic "observations" ( $o_{k,j}$ ) for each dialogue turn.
Retrieval and Conditioning: For each new turn, agents condition generation on the present session history, most recent summary, relevant atomic observations retrieved from memory, and temporal events in $\mathcal{G}$ occurring since the prior session.

This design enables conditioning on fine-grained and temporally relevant facts, outstripping sliding window or naïve summarization strategies in preserving causality and local coherence over extended interactions (Maharana et al., 2024).

3. Benchmark Tasks and Evaluation Protocols

TReMu formalizes three interlocked benchmarks, each targeting a distinct facet of long-term memory and reasoning:

Task	Model Input Modalities	Principal Metrics
Question Answering (QA)	Dialogue, Memory	F1, Recall@k
Event Summarization	Dialogue, Memory	FactScore, ROUGE-1/2/L
Multi-Modal Generation	Dialogue, Image	BLEU-1/2, ROUGE-L, MMRelevance

QA Taxonomy

QA is decomposed into single-hop, multi-hop, temporal, open-domain, and adversarial (unanswerable) reasoning. Performance is measured via token-overlap F1 between system and reference answers. For retrieval-augmented settings, Recall@k is computed with respect to relevant context chunk retrieval.

Event Extraction and Graph Recovery

Event summarization entails reconstructing the underlying $\mathcal{G}$ from the full dialogue. FactScore (decomposing summary and reference graphs into minimal factual units) is the primary metric; ROUGE provides surface overlap assessment.

Multimodal Grounding

For image-grounded dialogue, the generated next turn (text+image) is evaluated against human references using BLEU, ROUGE-L, and MMRelevance—a metric tailored to multimodal relevance (Maharana et al., 2024).

4. LoCoMo-Derived Temporal Reasoning Benchmark Construction

The TReMu framework is extended in (Ge et al., 3 Feb 2025) to produce a LoCoMo-derived temporal reasoning benchmark. Its construction follows a principled multi-stage approach:

Temporal Event Extraction: Employs LLMs (GPT-4o) to identify all time-tagged events ( $\delta$ ) in each session.
Temporal Linking: Aggregates cross-session events referencing the same entities or themes.
QA Generation: Constructs multi-choice questions over events, spanning three canonical types:
- Temporal Anchoring: "On what date did $e$ occur?"
- Temporal Precedence: "Which occurred first, $(e_1, \ldots, e_k)$ 0 or $(e_1, \ldots, e_k)$ 1?"
- Temporal Interval: "How many days between $(e_1, \ldots, e_k)$ 2 and $(e_1, \ldots, e_k)$ 3?" Unanswerable questions are systematically included via ambiguous or insufficient evidence cases.
Manual Curation: Human annotators correct ambiguous or erroneous items and validate distractor quality.

The resulting dataset comprises 600 QAs (264 anchoring, 102 precedence, 234 interval, 112 unanswerable), doubling the temporal reasoning depth compared to original LoCoMo (Ge et al., 3 Feb 2025).

5. Neuro-Symbolic Temporal Reasoning Methodology

TReMu integrates a neuro-symbolic approach to temporal reasoning, combining LLM-based memory retrieval, code generation, and symbolic execution:

Time-Aware Memorization: Each session is summarized into a timeline entry—an explicit set of $(e_1, \ldots, e_k)$ 4 tuples—transforming temporally ambiguous mentions (“last Monday”) into resolved absolute dates via date-inference mapping anchored on session timestamps.
Memory Pool Construction: The agent’s memory pool $(e_1, \ldots, e_k)$ 5 aggregates all such timeline events across sessions.
Retrieval and Code Generation: For a given temporal question $(e_1, \ldots, e_k)$ 6, LLMs retrieve the $(e_1, \ldots, e_k)$ 7 most relevant $(e_1, \ldots, e_k)$ 8 pairs and generate Python code to compute answers using standard libraries (datetime, dateutil).
Symbolic Execution: The generated code is executed to yield intermediate computations (e.g., date spans, ordinal comparisons), informing the final answer selection.

Sample code patterns include temporal interval calculations, week/month boundary determination, and chronological ordering. Constraints and reasoning rules are enforced via function calls (e.g., relativedelta, weekRange) and in-context code exemplars (Ge et al., 3 Feb 2025).

6. Experimental Results and Comparative Analysis

Extensive benchmarking across standard and advanced prompting paradigms demonstrates clear advantages for the TReMu methodology:

Baseline Performance: Standard prompting (SP) and Chain-of-Thought (CoT) show limited accuracy on temporal QA (e.g., GPT-4o: 29.8–61.7% across task types).
MemoChat and Timeline Summarization: Memory-augmented models with time-aware summarization improve results (MemoChat + CoT, Timeline + CoT).
Full TReMu Pipeline: Introducing neuro-symbolic code reasoning elevates temporal QA accuracy on GPT-4o to 77.7%, doubling standard approaches. For unanswerable questions, TReMu attains a precision of 55.5% and F1 of 64.4%, outperforming all baselines.
Robustness: Execution failure rates for generated code are approximately 6% on GPT-4o (higher for smaller models). Regeneration loops mitigate syntax/runtime errors.

Method	TA Acc. (%)	TP Acc. (%)	TI Acc. (%)	Overall (%)	Unans. Prec.	Unans. Rec.	Unans. F1
SP (no CoT)	18.2	58.8	30.3	29.8	46.9	13.4	20.8
CoT	67.8	74.5	49.1	61.7	42.6	43.8	43.2
MemoChat+CoT	51.1	49.0	26.5	41.7	24.8	81.3	38.0
Timeline+CoT	83.3	78.4	58.6	71.5	48.5	58.0	52.8
TReMu (full)	84.5	81.4	68.4	77.7	55.5	76.8	64.4

These results generalize across strong and moderate LLMs (GPT-4o-mini, GPT-3.5-Turbo). Qualitative error analysis highlights TReMu’s capacity to disambiguate temporal anchors and reduce confusion over intervals, outperforming language-only approaches particularly when date inference is required (Ge et al., 3 Feb 2025).

7. Implications, Best Practices, and Limitations

Key methodological advances distilled from LoCoMo/TReMu include:

Explicitly structured memory—distinguishing between session summaries and atomic observations—yields superior retrieval and content grounding compared to windowed or undifferentiated contexts.
Persona/event graph grounding ensures consistent agent behaviors and temporal-causal coherence.
Retrieval-augmented, symbolic reasoning (code execution) mitigates error-prone natural language inference and improves handling of temporal expressions.
Human-in-the-loop correction remains essential for repairing long-range inconsistencies and multimodal mismatches.
Specialized evaluation metrics (FactScore, MMRelevance), attuned to factual and multimodal coherence, better measure long-term conversational fidelity.

Principal limitations include the benchmark’s focus on multiple-choice QA; extension to generative QA would further realism. Residual code synthesis failures and intrinsic ambiguity in natural dialogues persist. Extending reasoning to richer temporal logics (beyond before/after/interval) represents an open research direction (Maharana et al., 2024, Ge et al., 3 Feb 2025).

Overall, the LoCoMo-derived TReMu framework provides a principled testbed and architectural paradigm for probing and advancing very long-term conversational and temporal memory in LLM agents.

Markdown Report Issue Upgrade to Chat

References (2)

Evaluating Very Long-Term Conversational Memory of LLM Agents (2024)

TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoCoMo-derived Multi-Session Dialogues (TReMu).

TReMu: LoCoMo-Derived Multi-Session Dialogues

1. Generation and Structure of LoCoMo Dialogues

2. Explicit Memory and Reflect-and-Respond Architecture

3. Benchmark Tasks and Evaluation Protocols

QA Taxonomy

Event Extraction and Graph Recovery

Multimodal Grounding

4. LoCoMo-Derived Temporal Reasoning Benchmark Construction

5. Neuro-Symbolic Temporal Reasoning Methodology

6. Experimental Results and Comparative Analysis

7. Implications, Best Practices, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TReMu: LoCoMo-Derived Multi-Session Dialogues

1. Generation and Structure of LoCoMo Dialogues

2. Explicit Memory and Reflect-and-Respond Architecture

3. Benchmark Tasks and Evaluation Protocols

QA Taxonomy

Event Extraction and Graph Recovery

Multimodal Grounding

4. LoCoMo-Derived Temporal Reasoning Benchmark Construction

5. Neuro-Symbolic Temporal Reasoning Methodology

6. Experimental Results and Comparative Analysis

7. Implications, Best Practices, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research