LiveSearchBench: Temporal QA Benchmark

Updated 9 November 2025

LiveSearchBench is a dynamic benchmark that evaluates large language models on question answering tasks requiring fresh, retrieval-based knowledge.
The benchmark employs a systematic four-stage pipeline, including differential extraction, high-quality candidate filtering, hierarchical question synthesis across L1 to L3, and SPARQL validation to ensure answer uniqueness.
Empirical results demonstrate that retrieval-augmented methods significantly outperform direct prompting, highlighting the importance of evidence acquisition for temporal reasoning.

LiveSearchBench is a fully automated, dynamically regenerable benchmark designed to evaluate LLMs on question answering (QA) tasks that strictly require retrieval and reasoning over newly introduced knowledge, as opposed to static, memorization-centric evaluation. By systematically constructing queries from temporal deltas between Wikidata snapshots and enforcing unique, verifiable answers via SPARQL validation, LiveSearchBench provides a principled methodology for assessing models on facts that unequivocally post-date any model’s pretraining corpus, ensuring temporally grounded and retrieval-dependent QA evaluation.

1. Formalization of Temporal Deltas and Data Filtering

The core of LiveSearchBench is the precise definition and utilization of the "delta" ( $\Delta$ ) between two temporal Wikidata knowledge graph snapshots. Let $G_T$ denote the set of subject–predicate–object triples extracted from a Wikidata snapshot at time $T$ . The delta capturing new and updated knowledge between $T_0 < T_1$ is defined as: $\Delta^+ = G_{T_1} \setminus G_{T_0}$

$\Delta^\circ = \{ (s, r, o_1) \in G_{T_0}, (s, r, o_2) \in G_{T_1} : o_1 \neq o_2 \}$

$\Delta = \Delta^+ \cup \Delta^\circ$

Here, $\Delta^+$ corresponds to insertions (entirely new triples), while $\Delta^\circ$ captures updated objects for pre-existing $(s, r)$ pairs. Every triple in $\Delta$ introduces knowledge absent at $T_0$ .

To ensure data suitability for question synthesis, LiveSearchBench applies the following filters:

Relation Allow-list: Exclusion of trivial/formatting predicates (e.g., P31, metadata), using a comprehensive block-list.
Entity Quality: Entities must have multilingual labels or aliases and robust descriptive metadata; deprecation, disambiguation, or severe ambiguity serve as exclusion criteria.
Statement Validity & De-duplication: Deprecated or rank-0 statements are removed, and near-duplicate assertions are merged using normalized keys or statement IDs.

This pipeline produces a high-quality, noise-reduced pool of temporally novel triples.

2. Automated Generation Pipeline and Reasoning Complexity

LiveSearchBench employs a four-stage pipeline to create QA instances:

Differential Extraction: Extracts $G_{T_0}$ and $G_{T_1}$ , computes $\Delta$ .
High-quality Candidate Filtering: Filters $\Delta$ as described above.
Hierarchical Question Synthesis: Constructs natural-language questions at three predefined reasoning levels:
- L1 (Single-hop): Direct queries over a triple $(a, r, ?)$ ; accepted only where $|\{ x : (a, r, x) \in G_{T_1} \}| = 1$ . E.g., “In which country will the ICLR 2026 conference be held?” for the triple (ICLR2026, country, Brazil).
- L2 (Multi-constraint/Compositional): Intersects anchor triples to ensure exactly one unique answer. Sets $S_1, S_2, \dots$ are formed over intersecting entities $x$ ; $|S_1 \cap S_2| = 1$ . Example: “Which football player has played for Real Madrid, Juventus, and Al Nassr?”
- L3 (Fuzzing + Extra Hop): Broadens constraints (“fuzzing”) and adds another relational path; e.g., replacing an entity with a category ("Saudi Arabian club") and introducing an additional hop, enforcing uniqueness across all constraints.
Final Rendering and SPARQL Validation: Transforms questions using latest labels and validates answer uniqueness via SPARQL queries. Snapshots and SPARQL results are logged for exact traceability.

The pipeline’s pseudocode is as follows:

def generate_benchmark(dump_T0, dump_T1):
  G0, G1 = extract_triples(dump_T0), extract_triples(dump_T1)
  delta = (G1 - G0) ∪ updates(G0, G1)
  candidates = filter_triples(delta)
  instances = []
  for t in candidates:
    for level in ["L1", "L2", "L3"]:
      q = synthesize_question(t, G1, level)
      if q and validate_sparql(q.sparql, G1):
        instances.append(q)
        break
  return instances

3. SPARQL-based Validation and Answer Uniqueness

Every generated question is automatically associated with a SPARQL query that ensures a unique, verifiable answer in $G_{T_1}$ :

1
2
3

SELECT ?b
WHERE { wd:Q_a wdt:P_r ?b }
LIMIT 2

Accepted only if exactly one row is returned.

SELECT ?x
WHERE { { wd:Q_a1 wdt:P_r1 ?x } UNION { wd:Q_a2 wdt:P_r2 ?x } }
GROUP BY ?x
HAVING (COUNT(?x)=2)

SELECT ?x
WHERE {
  { wd:Q_a1 wdt:P_r1 ?x FILTER(fuzz1(?x)) }
  UNION
  { wd:Q_a2 wdt:P_r2 ?x FILTER(fuzz2(?x)) }
  UNION
  { ?x wdt:P_r3 wd:Q_c }
}
GROUP BY ?x
HAVING (COUNT(?x)=3)

This methodology enforces that every question is anchored in the most current state of Wikidata and is unambiguous with respect to the answer set cardinality.

4. Experimental Design and Evaluation Metrics

LiveSearchBench assessments use distinct temporal splits, e.g.,

Dataset	Snapshot Range	L1	L2	L3
LiveSearchBench-2021	Sep 2021 → Dec 2021	150	100	50
LiveSearchBench-2025	May 2025 → Aug 2025	150	100	50

Model types evaluated:

Vanilla Prompting: Direct Answer and Chain-of-Thought methods on instruction-tuned Llama 3.2-3B, Qwen 2.5-3B, 7B, and 14B.
Retrieval-augmented: RAG, Search-o1, Search-R1 (RL-trained base & instruction-tuned), SSRL (self-search RL).

Metrics:

Exact Match (EM): $EM = \frac{1}{N}\sum_i 1\{\hat{y}_i = y_i\}$
Pass@ $k$ : For parametric inference, the rate at which the correct answer appears in the top- $k$ outputs.
Retrieval Recall@ $k$ : Fraction of questions where the gold answer is among top- $k$ retrieved documents.
Recency Gap: $\mathrm{RecencyGap} = EM_{2021} - EM_{2025}$

5. Empirical Findings and Analytical Insights

Key results from systematic evaluation include:

Recency Gap: All methods showed significant decrease in EM from 2021 to 2025 batches. For 3B models/methods: $EM_{2021} = 28.8\%$ , $EM_{2025} = 12.2\%$ , $Gap \approx 16.6$ points. This drop is most pronounced on multi-hop levels (L2: $-14.1$ , L3: $-7.5$ ).
Retrieval-augmented Superiority: Methods such as RAG, Search-o1, Search-R1, and SSRL yield an absolute +15–20 point gain in EM on the 2025 batch relative to prompting, corresponding to 200–300% relative gains. On 2021, the gain is only +5–10 points. This underscores retrieval's indispensability for facts outside the pretraining window.
Instruction Tuning and Model Scale: Instruction-tuned Search-R1 outperforms base Search-R1 by 5–8 points on 2025 data. Scaling model size (3B → 7B → 14B) consistently increases EM (e.g., Qwen 2.5-14B RAG: 52.8% on 2021, 26.6% on 2025), but does not eliminate the recency gap—top 14B instruction-tuned Search-R1 achieves 28.4% EM on 2025, nearly 25 points below 2021.
Single- vs. Multi-hop Reasoning: Accuracy consistently declines from L1 to L3 in both splits (e.g., Qwen 14B-RAG, 2025: L1 = 34.7%, L2 = 27.0%, L3 = 18.0%). This reflects increased failure rates in multi-hop settings, suggesting heightened vulnerability to stale or misleading evidence in compositional reasoning scenarios.

6. Conceptual Significance and Future Directions

LiveSearchBench reorients QA evaluation from static, memorization-focused paradigms to dynamic, retrieval-centered tasks. By enforcing verified, temporally grounded answers and leveraging continual delta-based regeneration, it provides the following conceptual advances:

Ambiguity Elimination: Every question is anchored in a SPARQL COUNT=1 proof, reducing ambiguity to zero.
Temporal Grounding: Queries are intrinsically tied to the knowledge state at $T_1$ , ensuring validity with respect to information recency.
Evidence Acquisition Requirement: The benchmark structure enforces genuine evidence retrieval, not just parametric recall.

This suggests that LiveSearchBench provides a systematic, low-touch platform for longitudinal evaluation of LLMs on temporally dynamic knowledge, crucial for research in retrieval-augmented, RL-trained, and lifelong learning models.

A plausible implication is the permanent shift toward benchmarks that can regenerate and maintain relevance as world knowledge and LLM architectures evolve. As a consequence, LiveSearchBench is positioned to become a foundational asset for research into dynamic knowledge integration, time-sensitive reasoning, and real-time QA over evolving information landscapes.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to LiveSearchBench.