Papers
Topics
Authors
Recent
2000 character limit reached

TWiki-Diffsets: Lightweight Temporal Updates

Updated 13 January 2026
  • TWiki-Diffsets are minimal text deltas from monthly Wikipedia snapshots that capture new or modified content for efficient language model updates.
  • They enable continual pretraining by focusing on recent changes, achieving roughly 30% lower perplexity on updated text compared to full data retraining.
  • Diffset-based updates are 10–12× faster than full snapshot updates, with methods like K-Adapter mitigating catastrophic forgetting.

TWiki-Diffsets are monthly, minimal text deltas extracted from consecutive English Wikipedia snapshots, specifically designed as a lightweight corpus for continual pretraining of large LMs. Introduced by Jang et al. in “TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving LLMs” (Jang et al., 2022), TWiki-Diffsets provide an efficient mechanism to inject up-to-date world knowledge into LMs with drastically reduced computation requirements and robust empirical performance. TWiki-Diffsets form the foundation of the TemporalWiki benchmark, enabling systematic tracking of an LM’s ability to acquire and retain evolving factual knowledge over time.

1. Formal Definition and Representation

Let StS_t represent the full text of English Wikipedia at time tt, and St+1S_{t+1} the subsequent monthly snapshot. The TWiki-Diffset for each interval is defined as the set of sentences that are either newly introduced or have experienced modifications in St+1S_{t+1} relative to StS_t; sentences deleted from StS_t are disregarded, reflecting a knowledge-updating objective rather than knowledge removal.

Articles are indexed by a unique identifier ii. For article ii, denote its contents at times tt and t+1t+1 as textt(i)text_t(i) and textt+1(i)text_{t+1}(i), respectively. The per-article diff is formulated as:

Δtexttt+1(i)={textt+1(i),if i is new in St+1 {stextt+1(i)s does not appear verbatim in textt(i)},otherwise\Delta text_{t \rightarrow t+1}(i) = \begin{cases} text_{t+1}(i), & \text{if } i \text{ is new in } S_{t+1} \ \{s \in text_{t+1}(i) \mid s \text{ does not appear verbatim in } text_t(i)\}, & \text{otherwise} \end{cases}

The global TWiki-Diffset for the interval becomes: ΔStt+1=iΔtexttt+1(i)\Delta S_{t \rightarrow t+1} = \bigcup_i \Delta text_{t \rightarrow t+1}(i)

A parallel evaluation dataset, “TWiki-Probes,” is constructed from Wikidata knowledge-graph dumps (WDt,WDt+1WD_t, WD_{t+1}). Knowledge triples (s,r,o)(s, r, o) are labeled as “Changed” if oo is new or altered, and “Unchanged” otherwise, subject to stringent alignment and heuristic filtering criteria.

2. Data Processing Pipeline and Corpus Statistics

Extraction and Storage

The TWiki-Diffset extraction algorithm operates by iterating through all articles in St+1S_{t+1}:

  • If an article’s id does not exist in StS_t, the entire article is appended.
  • If the article exists, paragraphs are compared sequentially, and only changed or new sentences are retained.

Each monthly TWiki-Diffset is stored as a flat text file. Empirical statistics for four representative intervals in 2021 (Aug–Dec) indicate that:

Interval Articles in ΔS\Delta S (K) Tokens in ΔS\Delta S (M) Full Snapshot Size: StS_t (B tokens)
08→09 2021 299 346 4.6
09→10 2021 314 362 4.7
10→11 2021 329 376 4.7
11→12 2021 314 369 4.7

Each diffset typically contains \sim300K articles and \sim347M tokens, a small fraction of the complete corpus (\sim6.3M articles, \sim4.6–4.7B tokens per snapshot).

Probe Construction and Filtering

After initial extraction, TWiki-Probes undergo multiple steps:

  • Initial triple categorization yields \sim1.2M Changed and $0.5$M Unchanged triples per month.
  • Alignment and filtering reduce the set to \sim2–3K Changed and \sim7–10K Unchanged examples.
  • Heuristic constraints (e.g., object max length \leq 5 words, frequency caps, substring overlap avoidance) ensure probe quality.

3. Continual Learning Protocols

Continual learning with TWiki-Diffsets is operationalized as follows (protocol from Section 4.1):

  • Base Model: GPT-2 Large (774M params), continually pretrained to August 2021 (“Initial”).
  • Full Update: Continue pretraining Initial on the entire next snapshot St+1S_{t+1} (one epoch; \sim4.6B tokens, \sim140K global steps; \sim24h on 8×V100 GPUs).
  • Diff Update: Continue pretraining Initial on ΔStt+1\Delta S_{t \rightarrow t+1} only (\sim347M tokens, \sim12K steps; \sim2.5h).
  • Optimization: Batch size 64, sequence length 512, peak learning rate 1×1041 \times 10^{-4} (one-cycle schedule [Smith, 2018]), cross-entropy loss:

L(θ;D)=wDlogpθ(w  context)L(\theta; D) = -\sum_{w \in D} \log p_{\theta}(w~|~context)

Perplexity computed as

PPL(D)=exp(L(θ;D)/D)PPL(D) = \exp(L(\theta; D)/|D|)

Three continual-learning algorithmic variants are applied to the Diff protocol:

  • RecAdam: Regularization-based update.
  • Mix-review: Rehearsal using August 2021 data.
  • Parameter-expansion methods: LoRA and K-Adapter.

4. Experimental Outcomes

Intrinsic Perplexity

Proper-noun perplexity on the Diff corpus (ΔS\Delta S) reveals:

  • Diff protocol achieves \sim30% lower perplexity than Full on changed text, indicating enhanced efficiency in acquiring new information.
  • On unchanged text (“Non-Diff”), Diff protocol exhibits rising perplexity (catastrophic forgetting) over time, whereas Full remains stable.
  • Continual-learning methods (especially Mix-review and K-Adapter) effectively mitigate forgetting; Non-Diff perplexity increases are less severe.

Extrinsic Probe Evaluation

Zero-shot perplexity results on TWiki-Probes (Table 3) indicate:

Protocol Avg. PPL (Unchanged/Changed) Update Time (h)
Initial 375–405
Full 370–413 ~24
Diff 346–416 ~2.5
RecAdam/Mix-review/LoRA 306–388 2–6
K-Adapter 319–360 ~2

Diff training is particularly strong on Changed probes but performance degrades on Unchanged over time. RecAdam, Mix-review, LoRA, and especially K-Adapter provide improved stability-plasticity trade-off and temporal robustness, as confirmed by modest PPL increases when evaluating on non-aligned months.

Computational Analysis

Diff-based continual learning is $10$–12×12\times faster than full snapshot updates (2–2.5h vs. 24h per update on the same hardware), with parameter-efficient algorithms (LoRA, K-Adapter) matching these speedups.

5. Advantages, Limitations, and Open Directions

Strengths

  • TWiki-Diffsets enable drastic computational savings (%%%%42WDt,WDt+1WD_t, WD_{t+1}43%%%% less compute).
  • Efficient plasticity: focuses model learning on genuinely new/updated facts.
  • Supports flexible integration of continual-learning techniques (e.g., rehearsal, parameter expansion) to mitigate catastrophic forgetting.
  • Fully automated, updated monthly, and does not require manual annotation.

Limitations and Challenges

  • Deletions of outdated or incorrect facts are not addressed; strategies for negative updates remain underexplored.
  • Not all Wikipedia/Wikidata changes correspond to real-world fact alterations, introducing noise into diffsets.
  • TWiki-Probes, being synthetic S–R–O triples, produce high zero-shot PPL; further natural-language evaluation methods (e.g., QA or targeted light-tuning) are desirable for fine-grained knowledge retention assessment.
  • Adapters (LoRA/K-Adapter) cause parameter growth over time, posing challenges for long-term scalability and optimal update-frequency trade-offs.

A plausible implication is that continual training with minimal diffsets could become a practical paradigm for maintaining temporally aligned, ever-evolving LMs, provided that future work addresses negative updates and improved evaluation protocols (Jang et al., 2022).

6. Research Significance and Future Perspectives

TWiki-Diffsets represent a scalable strategy for perpetual LM adaptation to an evolving knowledge base, paving the way for models resilient to temporal misalignment and catastrophic forgetting. The accompanying benchmarks and corpus extraction pipelines facilitate reproducible, granular evaluation of both stability and plasticity in dynamic knowledge environments.

Their deployment suggests broader applicability of delta-based continual learning beyond Wikipedia, contingent on further research into negative updates, naturalistic probe design, and long-term model compression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to TWiki-Diffsets.