WikiRAG: Wikipedia-based RAG Pipeline

Updated 5 December 2025

The WikiRAG framework disambiguates entities and retrieves structured Wikipedia data using graph, text, and vector indexing for robust multi-entity question answering.
It employs schema induction and table-based reasoning with LLMs to extract, normalize, and reason over information, achieving up to 29.6% improved aggregation accuracy.
Its modular, declarative design enables rapid experiments by swapping hybrid retrievers and LLM components, ensuring scalable, fact-grounded outputs.

A Wikipedia-based Retrieval-Augmented Generation (WikiRAG) pipeline is a structured information retrieval and reasoning framework that leverages Wikipedia as its principal knowledge source for question answering and document generation. It disambiguates entity references, retrieves relevant content over a Wikipedia graph or corpus, structures information into relational or tabular forms, and applies LLM-based or symbolic reasoning to synthesize coherent, fact-grounded answers or comprehensive article drafts. WikiRAG encompasses both multi-entity QA configurations and large-scale automated article generation, incorporating modular architectures, hybrid retrieval models, and explicit evaluation of factual correctness and faithfulness.

1. Architectural Foundations and System Stages

WikiRAG instantiates a workflow in which user or system queries are parsed, translated into graph-based retrieval specifications, and executed to gather pertinent subsets of Wikipedia knowledge. For multi-entity QA, the pipeline stages include:

Query Understanding: Modern LLMs (e.g., GPT-4) are used to classify the question type $t \in T$ (aggregation, superlative, compositional, etc.), identify seed entities $V_1$ , and desired properties $P_1$ , and to output a rough SPARQL or retrieval skeleton (Lin et al., 3 Mar 2025).
Wikipedia Graph Retrieval: Wikipedia and Wikidata are modeled as a knowledge graph $G = (V, E, P)$ . Explicit subgraph retrieval leverages a triple store (e.g., Blazegraph) in conjunction with a text or vector index (ElasticSearch, FAISS). Subgraph extraction targets nodes $V_{sub}$ within $k$ -hops of $V_1$ and their linking relations.
Table Generation (SQA): Schema $S = \{col_1: type_1, ..., col_m: type_m\}$ is guessed by an LLM, and each entity is mapped to a row via property extraction from the retrieved graph/text.
Table-based Reasoning: Prompted with $(S, T_\mathrm{sample}, Q)$ , an LLM (e.g., GPT-4) generates logical plans such as SQL queries and outputs final answers. SQL execution produces interpretable, exact results.

For Wikipedia text generation (as in WIKIGENBENCH), the pipeline encompasses retrieval of web citations, hierarchical outline planning, and generation phases (Zhang et al., 28 Feb 2024).

2. WikiRAG Knowledge Representation and Retrieval

WikiRAG formalizes Wikipedia as a typed graph:

Nodes ( $V$ ): Wikipedia articles or Wikidata entities (e.g., “Turing Award”).
Edges ( $E$ ): Typed relations $(v_i, v_j, r)$ (e.g., $(\text{Alan Turing}, \text{United Kingdom}, \text{nationality})$ ).
Properties ( $P$ ): Key-value attributes on nodes or edges (e.g., $P(\text{Alan Turing}).\text{birthDate} = \mathtt{"1912-06-23"}$ ).

Retrieval employs a combination of:

Text Indexing: ElasticSearch over article text and abstracts.
Vector Indexing: Dense retrieval using FAISS (e.g., SBERT embeddings).
Graph Indexing: SPARQL endpoints for subgraph and property-level lookup.

Scoring integrates textual relevance with graph proximity: $\text{Score}(v) = \alpha \cdot \text{sim}_\text{text}(Q, \text{text}(v)) + \beta \cdot \exp(-\text{dist}_G(v, V_1)), \quad \alpha + \beta = 1$ Typical settings are $\alpha=0.7$ , $\beta=0.3$ (Lin et al., 3 Mar 2025).

3. Structured Extraction and Table Reasoning

Entity extraction is decoupled from reasoning via a two-stage approach:

Schema Induction: The schema $S$ is prompted/induced from the question. For example, for “How many Turing Award winners are Canadian?”: $S = [\text{name: STRING}, \text{nationality: STRING}, \text{award\_year: INT}]$ .
Table Filling: For each node $v \in V_\mathrm{sub}$ , document content is parsed and mapped to a schema-compliant row by an LLM. Normalization procedures canonicalize values, mapping synonyms and multi-word variants to standard forms.
Table Reasoning: LLMs are prompted with schema, partial table samples, the original question, and chain-of-thought cues, producing pseudo-SQL:
1
SELECT COUNT(name) FROM T WHERE nationality='Canadian';
Execution yields exact answers and ensures reasoning transparency. Chain-of-thought templates prompt the LLM to enumerate logical steps.

This structured RAG, as in SRAG, enables a clean retrieval-reasoning decoupling, lowering token usage and improving aggregation accuracy by 29.6% over text-only RAG in MEQA scenarios (Lin et al., 3 Mar 2025).

4. Declarative Pipeline Implementation and Modularity

WikiRAG architectures support pipeline decomposability and interchangeability of components:

Declarative Construction: PyTerrier-based pipelines define retrievers (BM25, SPLADE, DPR), hybridize scores, and chain rerankers (MonoT5, DuoT5), culminating in LLM-based generation via reader modules (Macdonald et al., 12 Jun 2025).
Operator Notation: Pipelines such as
1 2
hybrid = bm25 + splade + dpr wikirq = (hybrid >> rerank >> Concatenator() >> gpt_reader)
allow fast swapping and reranking for experiments or production.
End-to-End Evaluation: Integrated metrics EM, F1, and BLEU-4 evaluate answer-level and overlap fidelity. End-to-end scripts orchestrate streaming Wikipedia indexing, retrieval, reranker fusion, answer generation, and dataset-based evaluation (e.g., Natural Questions).

This modularity allows researchers to test alternative retrieval or reasoning components without full system retraining.

5. Specialized Graph and Domain Extensions

Extending RAG with domain-specific graph structures enhances retrieval precision:

Material Science (G-RAG): Pipeline parses PDFs (text, tables, figures), extracts entities (MatIDs), grounds them to Wikipedia, stores as nodes in a Neo4j-like graph, and retrieves Wikipedia passages based on MatID queries (Mostafa et al., 21 Nov 2024).
Graph Database Fusion: Relations are typed edges, e.g. (“Iron”, “hasComposition”, “Fe”) with embedding support for semantic matching. Subgraphs are selected per query, and context is constructed by concatenating node and edge fields.
Faithfulness Enforcement: Only facts present in the retrieval context $C$ are eligible for generation, with subsequent faithfulness verification. G-RAG outperforms naive and simple graph RAG baselines on correctness (3.90 vs 2.43 and 3.30), faithfulness, and relevancy (see Table below).

System	Correctness	Faithfulness	Relevancy
Naive RAG	2.43 ± 1.51	0.70 ± 0.48	0.39 ± 0.28
Graph RAG	3.30 ± 2.00	0.90 ± 0.32	0.18 ± 0.26
G-RAG	3.90 ± 1.10	0.90 ± 0.32	0.34 ± 0.32

(Scores from (Mostafa et al., 21 Nov 2024), Table 4, 10 queries, thresholded)

A plausible implication is that extending WikiRAG with explicit graph schemas and domain adaptation improves precision and interpretability in scientific use cases.

6. Automated Wikipedia Article Generation and Hierarchical RAG

WikiRAG enables automatic generation of Wikipedia-style articles for new events:

Problem Definition (Zhang et al., 28 Feb 2024): Input comprises a set of retrieved web documents $C_E = \{C_1, ..., C_L\}$ , and output is a structured article $W = \{ (T_n, S_{n,m}, R_{n,m}) \}$ with sections, sentences, and supporting citations.
Stage-wise Generation:
- Retrieval: Retrieve $D_E$ via BM25, TF–IDF, or dense models.
- Outline Planning: (Hierarchical RAG) Generate section titles.
- Section-wise Generation: For each $T_n$ , re-retrieve and generate $S_{n,*}$ with citations.
Evaluation: Faithfulness (citation recall, precision), ROUGE-L, Infobox QA Score (IB Score), and GPT-4–based structural metrics.
Empirical Results:
- Hierarchical RAG (RPRR) improves ROUGE-L from 17.81 to 22.26 and IB Score from 10.73 to 22.29, compared to direct RAG (RR), and boosts citation metrics by ~10 points, though about 50% of content remains unsupported ( $\sim$ 50% hallucination rate).
- Sparse retrievers outperform dense models on rare entities.

Model	ROUGE-L	IB Score	Citation Rate	Length
RR	17.81	10.73	42.09	579 w
RPRR	22.26	22.29	50.96	1991 w

(Key results from (Zhang et al., 28 Feb 2024), Table A)

This suggests that hierarchical decomposition is essential for long-form Wikipedia generation, especially under context window constraints.

7. Practical Implementation, Limitations, and Extensions

Implementation guidelines emphasize:

Indexing: Wikipedia should be pre-parsed for both text and structured triple storage, with adjunct indices for k-hop traversal (Lin et al., 3 Mar 2025, Macdonald et al., 12 Jun 2025).
Scalability: Subgraphs are pruned by score or cardinality, and LLM extraction is batched with output control to constrain cost.
Modularity: WikiRAG architecture is extensible to other corpora (e.g., PubMed KB), other graph/query backends (GraphQL, SQL), and alternative LLM backends (Lin et al., 3 Mar 2025).
Limitations: Persisting errors include semantic drift in property identification, occasional row omissions ( $\sim$ 0.1%), and challenges in normalization of string values or multi-word synonyms (Lin et al., 3 Mar 2025).
Extensibility: Declarative pipelines and clear type boundaries facilitate experimentation, hybrid retrieval, and dynamic adjustments to changing knowledge graphs (Macdonald et al., 12 Jun 2025).
Future Enhancements: Further cited improvements include integrating NLI-based verifiers, expanding context length, and joint training of retriever and generator modules (Zhang et al., 28 Feb 2024).

In summary, WikiRAG generalizes RAG principles by emphasizing strong structure in entity, relation, and context representation, leveraging the Wikipedia graph, and supporting modular, extensible architectures for precise, scalable, and interpretable question answering and content generation (Lin et al., 3 Mar 2025, Macdonald et al., 12 Jun 2025, Mostafa et al., 21 Nov 2024, Zhang et al., 28 Feb 2024).