Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

FreshWiki Dataset Overview

Updated 27 July 2025
  • FreshWiki Dataset is a collection of recent, high-quality Wikipedia articles filtered for recency, quality, and structural coherence.
  • It provides both plain-text dumps and graph-structured data, supporting research in long-form generation, network analysis, and spatio-temporal event tracking.
  • The dataset comes with comprehensive evaluation metrics and benchmarking protocols, facilitating advancements in automated text generation and anomaly detection.

The FreshWiki Dataset refers to several rigorously curated resources for Wikipedia-centric research, each engineered to address data needs in structural analysis, user paper, generative modeling, and quantitative information extraction. Several recent benchmarks named “FreshWiki” are most prominently associated with datasets of high-quality, recent Wikipedia articles, deployed to advance automated long-form text generation and evaluation. Separately, FreshWiki-style data resources have been structured as property graphs for network science and spatio-temporal analysis. This entry surveys these variants as formalized in contemporary research.

1. Structural Organization and Curation Principles

Several FreshWiki datasets are distinguished by deliberate filtering for recency, article quality, and internal organizational structure (Shao et al., 22 Feb 2024, Liang et al., 8 Oct 2024, Gu et al., 2 Mar 2025):

  • Recency Criterion: Only pages created or substantially revised after the cutoff date of modern LLM pretraining are included, minimizing data leakage and ensuring up-to-date reference.
  • Quality Filtering: Articles are required to meet strict thresholds—commonly, a minimum of B-class according to the ORES (Objective Revision Evaluation Service) assessment, which confines the dataset to approximately the top 3% of Wikipedia articles in terms of human revision quality.
  • Structural Coherence: List-type articles and single-section pages are excluded, resulting in a corpus with multi-level sub-sections and clear headings, robust enough for benchmarking hierarchical content planning.

A representative curation pipeline (FreshWiki-2024) selects, for each month (e.g., February 2022–September 2023), the most-edited Wikipedia articles, filtering for length (typically >1,000 words) and multi-section structure. The extracted corpus is sometimes further pruned (e.g., to 100–100 samples in STORM or ~98 in recent LLM planning experiments) for rapid evaluation and human assessment (Liang et al., 8 Oct 2024).

Additionally, graph-structured versions (as in the original FreshWiki toolkit (Aspert et al., 2019)) model Wikipedia at the mesoscopic scale, organizing data with property graphs of article and category nodes linked via links_to and belongs_to relationships. Temporal components ("pagecounts") are attached separately to enable dynamic studies.

2. Data Content and Formats

Article-Centric Datasets

The core FreshWiki datasets comprise plain-text dumps of filtered Wikipedia articles, stripped of tables and non-textual markup but maintaining all section structure. For benchmarking LLM planning and generation, each sample includes:

  • Full article text
  • Section/subsection headings
  • (Optionally) a minimal prompt (e.g., “Generate a comprehensive Wikipedia page about [topic]”)
  • Reference/metadata: edit histories, revision IDs

Statistical profiles drawn from typical test splits reveal average section counts (e.g., ~8.4 sections/article), word limits (e.g., capped at 3,000 for evaluation efficiency), and high reference density.

Graph-Structured Data

In the FreshWiki property graph instantiation (Aspert et al., 2019):

  • Nodes: Wikipedia article and category pages (redirects are resolved to canonical pages).
  • Edges: links_to (hyperlink between articles) and belongs_to (article-to-category/subcategory).
  • Temporal Data: Hourly/daily viewership (“pagecounts”) indexed as key-value pairs (page ID, timestamp).
  • Thresholding: Only pages surpassing 100 daily visits are included in the time series, drastically reducing dataset size while preserving signal.

The graph can be deployed via Neo4j for the topology and a NoSQL store for time series data, enabling rapid subgraph extraction and temporal analyses.

Quantitative Annotation

The Wiki-Quantities and Wiki-Measurements subsets (Göpfert et al., 18 Mar 2025) annotate quantity spans and their context (entity, property, qualifiers) via automated parsing of {convert} templates and alignment with Wikidata facts, supporting numeric information extraction tasks.

3. Applications and Benchmarking Use

Long-Form Generation and LLM Evaluation

The FreshWiki corpus is fundamental in benchmarking retrieval-augmented, plan-guided, and multi-agent long-form document generation frameworks (Shao et al., 22 Feb 2024, Liang et al., 8 Oct 2024, Gu et al., 2 Mar 2025). State-of-the-art methods relying on FreshWiki include:

  • STORM (Shao et al., 22 Feb 2024): Evaluates outline generation by comparing predicted structure against FreshWiki references using heading soft recall, entity recall (via Sentence-BERT similarity), and overall ROUGE scores. Expert Wikipedia editors rate outputs on coverage, organization, and verifiability.
  • RAPID (Gu et al., 2 Mar 2025): Benchmarks attribute-constrained retrieval and plan-guided article composition, measuring gains in outline recall, coverage (interest, organization, relevance), and factuality (using FactAlign F1@300).
  • LLM Planning (Auxiliary Task) (Liang et al., 8 Oct 2024): Fine-tunes LLMs to generate intermediate summaries/outlines prior to article output, yielding +2.5% ROUGE-Lsum improvements and strong SxS human evaluation ratios (e.g., 3.6:1 win/loss over baselines).

A direct implication is that the FreshWiki article set, tailored for high structure and recency, exposes both the limitations of LLMs on long-form, knowledge-intensive writing and the effect of architectural innovations in planning or retrieval.

Graph-Based and Temporal Analysis

The property graph FreshWiki resource facilitates network science, temporal event tracking, and anomaly detection. For example, the combination of network (hyperlink/category) topology and hourly viewership data supports:

  • Real-time event monitoring (e.g., airline incidents) via traffic anomaly detection
  • Spatio-temporal mining to identify event-related clusters in the hyperlink graph
  • Studies of collective online memory and spreading phenomena across the Wikipedia network (Aspert et al., 2019)

4. Technical Methodology and Data Access

Curation and Storage Pipelines

  • Article Filtering: Multi-stage selection by edit count, ORES quality, structural markers, and word length; explicit exclusion of low-structure and list articles.
  • Graph Process: Wikipedia dumps are parsed to property graphs (articles/categories, with redirects resolved), enabling efficient BFS-based subgraph extraction and category traversals.
  • Temporal Data Storage: Viewership is indexed in a NoSQL database under (page ID, timestamp) composite keys, retaining only pages deemed relevant by view thresholds tvi(t)>100\sum_t v_i(t) > 100.
  • Quantitative Extraction: For Wiki-Quantities/Measurements, templates such as {convert} are parsed using Lua scripts; entity-value-property quadruples are aligned to Wikidata using rule-based and spaCy NLP pipelines, allowing some mean absolute percentage error (e.g., up to 3%). Validation samples report up to 100% precision for raw span annotation and 84–94% for structured measurement extraction.

Access and Deployment

  • Ad Hoc Querying: Graph-based datasets are accessed via custom Python libraries or Neo4j queries; time series via NoSQL interfaces.
  • Researcher Deployment: The full toolkit and guidelines are available for local/cloud instance setup (https://lts2.epfl.ch/Datasets/Wikipedia/ (Aspert et al., 2019)).
  • Open Source and Reproducibility: Quantitative extraction code and Snakemake workflows are published to promote reproducibility and extension to newer Wikipedia/Wikidata versions (Göpfert et al., 18 Mar 2025).

5. Evaluation Protocols and Metrics

Evaluation of models using FreshWiki focuses both on structure (outline and hierarchy) and content (factual fidelity, organization):

Metric Description Usage Context
Heading Soft Recall Cosine similarity (Sentence-BERT) between predicted and gold headings; union score formula Outline eval (Shao et al., 22 Feb 2024)
ROUGE-1 and ROUGE-L n-gram/sequence overlap with ground truth Article eval (Liang et al., 8 Oct 2024)
Entity Recall Coverage of gold-named entities in generated output (FLAIR NER) Article eval (Gu et al., 2 Mar 2025)
Human SxS Expert Wikipedia editor side-by-side comparison for interest, organization, verifiability Manual eval (Liang et al., 8 Oct 2024)
FactAlign F1@300 Precision/recall of factual claims per 300 words counted by alignment tool Factuality (Gu et al., 2 Mar 2025)

A notable result is the +2.5% increase in ROUGE-Lsum due to the use of auxiliary planning tasks, and a 3.6:1 human evaluator win/loss ratio for planning-augmented LLMs versus baseline generation (Liang et al., 8 Oct 2024).

6. Limitations and Considerations

Several constraints are documented:

  • Article Set Bias: By focusing on highly revised, high-quality recent articles, FreshWiki benchmarks reflect a small elite fraction of Wikipedia that might not generalize to noisier or low-edit domains.
  • Template and Extraction Limits: Wiki-Quantities primarily captures quantities expressed via {convert}; units like percentages and complex expressions are excluded, and template parsing sometimes generates artifacts or omits information (Göpfert et al., 18 Mar 2025).
  • Query Depth/Resource Limits: Very large or deeply nested subgraph queries in the property graph version can be resource intensive (potentially hours and large memory loads) (Aspert et al., 2019).
  • Annotation Noise: Distant supervision in Wiki-Measurements induces label noise, particularly for implicit properties or approximate matching strategies (MAPE up to 3%).
  • Language and Modality Coverage: While the pipeline is language agnostic in methodology, primary application and validation are for English Wikipedia; multimodal (e.g., table, image) content is generally omitted.

A plausible implication is that while FreshWiki offers a high-quality and reproducible benchmark for advanced generative and analytical tasks on Wikipedia, generalizing results to the full, heterogeneous Wikipedia or alternative domains may require further corpus extension or adaptation.

7. Significance and Impact

The emergence of FreshWiki datasets marks a methodological advance in Wikipedia modeling—from graph-theoretic analysis and spatio-temporal studies to precise benchmarking for the next generation of knowledge-grounded LLMs. Its dual focus on both real-world structure (via property graphs and view time series) and recency-constrained, high-quality textual samples underpins critical progress in areas including:

  • Testing factuality and hallucination resistance in long-form LLMs
  • Developing robust, explainable, and plan-aware automatic writing systems
  • Advancing event analytics and collective memory research in online knowledge bases
  • Enabling reproducible, regularly updated Wikipedia data access for open scientific inquiry

As a result, FreshWiki datasets are now positioned as authoritative reference corpora for text generation, knowledge extraction, and network dynamical studies across computational linguistics, AI, and digital humanities.