Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models (2204.14211v3)

Published 29 Apr 2022 in cs.CL

Abstract: LLMs (LMs) become outdated as the world changes; they often fail to perform tasks requiring recent factual information which was absent or different during training, a phenomenon called temporal misalignment. This is especially a challenging problem because the research community still lacks a coherent dataset for assessing the adaptability of LMs to frequently-updated knowledge corpus such as Wikipedia. To this end, we introduce TemporalWiki, a lifelong benchmark for ever-evolving LMs that utilizes the difference between consecutive snapshots of English Wikipedia and English Wikidata for training and evaluation, respectively. The benchmark hence allows researchers to periodically track an LM's ability to retain previous knowledge and acquire updated/new knowledge at each point in time. We also find that training an LM on the diff data through continual learning methods achieves similar or better perplexity than on the entire snapshot in our benchmark with 12 times less computational cost, which verifies that factual knowledge in LMs can be safely updated with minimal training data via continual learning. The dataset and the code are available at https://github.com/joeljang/temporalwiki.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Joel Jang (30 papers)
  2. Seonghyeon Ye (25 papers)
  3. Changho Lee (4 papers)
  4. Sohee Yang (23 papers)
  5. Joongbo Shin (14 papers)
  6. Janghoon Han (6 papers)
  7. Gyeonghun Kim (7 papers)
  8. Minjoon Seo (82 papers)
Citations (85)

Summary

TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving LLMs

The academic paper introduces a pivotal concept within the domain of natural language processing, particularly focusing on the adaptability and updating needs of LLMs (LMs) as they encounter temporal misalignment. This issue arises when the factual information used for training becomes outdated, making LMs less effective in real-world applications where knowledge is frequently updated. The paper presents TemporalWiki, a novel benchmark crafted to facilitate the continual pretraining of LMs by capitalizing on dynamic differences between snapshots of English Wikipedia and English Wikidata.

Overview and Methodology

TemporalWiki operates on a straightforward yet powerful premise: rather than retraining models using expansive and static datasets, it employs 'TWiki-Diffsets'—the calculated differences between consecutive monthly Wikipedia snapshots—as a training set to refresh model knowledge. By focusing exclusively on modified content, this approach significantly reduces computational expense while maintaining, or even improving, update efficiency. For evaluation, 'TWiki-Probes' are utilized, which consist of factual instances cataloged into 'Unchanged' and 'Changed' to scrutinize the models' retention (stability) and adaptation (plasticity) of knowledge.

The approach is systematic. First, it generates TWiki-Diffsets from Wikipedia by identifying changes at the sentence level between snapshots, while TWiki-Probes are constructed from Wikidata, ensuring alignment and quality to evaluate the LMs accurately. Such methodical processing is initiated and completed without the aid of human annotation, thereby making the benchmark adaptable to continuous updates.

Experimental Findings

The paper explores multiple training approaches using TemporalWiki, comparing baseline LLMs that leverage entire Wikipedia snapshots with those that employ TWiki-Diffsets. Results indicate marked efficiency and prowess for the latter, showing that training solely on updated data segments can yield better knowledge incorporation with significantly lower computational overhead—estimated at 12 times less—while mitigating the risk of catastrophic forgetting.

Among the continual learning strategies evaluated, including methods like RecAdam and Mix-review, parameter-expansion techniques such as K-Adapter and LoRA demonstrated robust balance between knowledge maintenance and growth. K-Adapter, notably, evidenced superior performance under conditions of temporal misalignment, indicating its potential as a viable strategy for real-world LM updating tasks.

Implications and Future Directions

TemporalWiki stands out in its ability to provide a replicable, automated benchmark for tracking the temporally evolving capabilities of LLMs. It illustrates that continual learning on data deltas rather than entire sets optimizes computational resources, a priority in modern AI systems. Furthermore, the effectiveness of continual learning methods in reducing forgetting provides clear insights into designing adaptive LMs that can keep pace with evolving real-world data.

The research imparts pragmatic thrusts for future AI development: focusing on more adaptive, efficient language learning models, exploring further mitigations for model forgetfulness, and developing more refined benchmarks inspired by real-world dynamic knowledge bases such as Wikipedia and Wikidata. TemporalWiki opens avenues to investigate the balance between model update frequency and the granularity of data changes, as well as facilitating investigation into handling model responses to deleted or deprecated knowledge, a challenge left for subsequent work.

Broadly, the implications of this paper form a cornerstone for AI systems demanding iterative learning capabilities—be it for voice assistants, recommendation systems, or real-time data interpretation engines. By foregrounding the importance of model adaptability and computational efficiency, TemporalWiki catalyzes progressive steps towards intrinsically updating and ever-relevant LLMs.