Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs (2506.07270v1)

Published 8 Jun 2025 in cs.CL

Abstract: LLMs exhibit remarkable capabilities in question answering and reasoning thanks to their extensive parametric memory. However, their knowledge is inherently limited by the scope of their pre-training data, while real-world information evolves continuously. Updating this knowledge typically requires costly and brittle re-training, or in-context learning (ICL), which becomes impractical at scale given the volume and volatility of modern information. Motivated by these limitations, we investigate how LLMs perform when exposed to temporal text corpora, or documents that reflect evolving knowledge over time, such as sports biographies where facts like a player's "current team" change year by year. To this end, we introduce two new benchmarks: Temporal Wiki, which captures factual drift across historical Wikipedia snapshots, and Unified Clark, which aggregates timestamped news articles to simulate real-world information accumulation. Our analysis reveals that LLMs often struggle to reconcile conflicting or outdated facts and can be misled when multiple versions of a fact appear in context. To address these issues, we propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training. This knowledge organization strategy enables models to retrieve and reason over temporally filtered, relevant information at inference time. Empirically, our method outperforms ICL and RAG baselines across both benchmarks, especially on questions requiring more complex reasoning or integration of conflicting facts.

PDF Abstract

Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

The paper "Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs" addresses a significant challenge in the field of LLMs — handling information that evolves over time. This research is particularly relevant as it advances our understanding of how LLMs can be equipped to deal with the dynamic nature of real-world knowledge, an aspect often neglected despite the widespread deployment of LLMs in applications that require up-to-date information.

Key Contributions

This paper makes several noteworthy contributions:

Introduction of Temporal Benchmarks: The paper introduces two benchmarks, Temporal Wiki and Unified Clark, which are critical for evaluating how well LLMs perform in scenarios where knowledge isn't static. Temporal Wiki is derived from historical snapshots of Wikipedia pages, focusing on factual information that changes over time. Unified Clark, on the other hand, aggregates timestamped news articles, providing a simulation of real-world, temporal information accumulation.
Analysis of LLM Struggles with Temporal Conflicts: The authors demonstrate that traditional LLMs face difficulties when confronted with temporal conflicts, such as outdated facts mixed with current data. This can lead to unreliable outputs, where the model may revert to outdated parametric knowledge despite the presence of more current information.
Proposal of a Novel Framework: To address these issues, the paper proposes a lightweight framework that incrementally builds an external memory from source documents. This approach allows models to retrieve and reason over temporally filtered, relevant information, improving both accuracy and reliability in question answering tasks, particularly those requiring complex reasoning across time-varying data.

Methodological Approach

The framework proposed centers around a structured and externalizable memory architecture that doesn’t require re-training of models, thus addressing the constraints of in-context learning (ICL) and continual re-training. The approach allows LLMs to navigate temporal information by maintaining and accessing an organized set of facts that reflect the most current knowledge state.

Datasets and Evaluation Methods: The Temporal Wiki benchmark draws from the Templama dataset, refocusing it to capture temporal knowledge dynamics from Wikipedia snapshots. Conversely, Unified Clark builds on the ERASE dataset, designed to challenge models with news articles that contain temporally bound facts.

Three models—Llama 3.1 70B, Llama 3 8B, and Mistral 7B—underwent examination, demonstrating varying degrees of efficacy when employing standard in-context learning, retrieval-augmented generation (RAG), and the proposed knowledge organization (KO) approach. The LLMs were subjected to evaluation on the benchmarks, with results showing a clear superiority of the KO strategy over ICL and RAG, particularly in scenarios enriched with complex temporal data.

Implications and Future Perspectives

The implications of this research are significant for both academic and application domains of AI:

Improved Temporal Reasoning: By demonstrating how structured memory can enhance the handling of evolving data, the paper sets a promising direction for improving temporal reasoning in AI systems.
Practical Applications: The proposed method could see applications in fields such as finance and journalism, where fast-paced data requires robust and current understanding.
Theoretical Advancement: The paper contributes to the theoretical understanding of LLM behavior in non-static environments, promoting the integration of dynamic knowledge management strategies within AI architectures.

Future investigations may delve into extending this framework to understand its impacts in even broader contexts, such as multi-modal data involving images and text or even cross-linguistic temporal reasoning. Additionally, exploring the scalability of such frameworks in higher-dimensional datasets would provide insights into their adaptability and efficiency.

Overall, this work represents a methodical step forward in adapting LLMs to meet the demands of real-world applications where knowledge is not static but continuously in flux.

Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs (2506.07270v1)