Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 178 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 56 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

LiveSearchBench: Temporal QA Benchmark

Updated 9 November 2025
  • LiveSearchBench is a dynamic benchmark that evaluates large language models on question answering tasks requiring fresh, retrieval-based knowledge.
  • The benchmark employs a systematic four-stage pipeline, including differential extraction, high-quality candidate filtering, hierarchical question synthesis across L1 to L3, and SPARQL validation to ensure answer uniqueness.
  • Empirical results demonstrate that retrieval-augmented methods significantly outperform direct prompting, highlighting the importance of evidence acquisition for temporal reasoning.

LiveSearchBench is a fully automated, dynamically regenerable benchmark designed to evaluate LLMs on question answering (QA) tasks that strictly require retrieval and reasoning over newly introduced knowledge, as opposed to static, memorization-centric evaluation. By systematically constructing queries from temporal deltas between Wikidata snapshots and enforcing unique, verifiable answers via SPARQL validation, LiveSearchBench provides a principled methodology for assessing models on facts that unequivocally post-date any model’s pretraining corpus, ensuring temporally grounded and retrieval-dependent QA evaluation.

1. Formalization of Temporal Deltas and Data Filtering

The core of LiveSearchBench is the precise definition and utilization of the "delta" (Δ\Delta) between two temporal Wikidata knowledge graph snapshots. Let GTG_T denote the set of subject–predicate–object triples extracted from a Wikidata snapshot at time TT. The delta capturing new and updated knowledge between T0<T1T_0 < T_1 is defined as: Δ+=GT1GT0\Delta^+ = G_{T_1} \setminus G_{T_0}

Δ={(s,r,o1)GT0,(s,r,o2)GT1:o1o2}\Delta^\circ = \{ (s, r, o_1) \in G_{T_0}, (s, r, o_2) \in G_{T_1} : o_1 \neq o_2 \}

Δ=Δ+Δ\Delta = \Delta^+ \cup \Delta^\circ

Here, Δ+\Delta^+ corresponds to insertions (entirely new triples), while Δ\Delta^\circ captures updated objects for pre-existing (s,r)(s, r) pairs. Every triple in Δ\Delta introduces knowledge absent at T0T_0.

To ensure data suitability for question synthesis, LiveSearchBench applies the following filters:

  • Relation Allow-list: Exclusion of trivial/formatting predicates (e.g., P31, metadata), using a comprehensive block-list.
  • Entity Quality: Entities must have multilingual labels or aliases and robust descriptive metadata; deprecation, disambiguation, or severe ambiguity serve as exclusion criteria.
  • Statement Validity & De-duplication: Deprecated or rank-0 statements are removed, and near-duplicate assertions are merged using normalized keys or statement IDs.

This pipeline produces a high-quality, noise-reduced pool of temporally novel triples.

2. Automated Generation Pipeline and Reasoning Complexity

LiveSearchBench employs a four-stage pipeline to create QA instances:

  1. Differential Extraction: Extracts GT0G_{T_0} and GT1G_{T_1}, computes Δ\Delta.
  2. High-quality Candidate Filtering: Filters Δ\Delta as described above.
  3. Hierarchical Question Synthesis: Constructs natural-language questions at three predefined reasoning levels:
    • L1 (Single-hop): Direct queries over a triple (a,r,?)(a, r, ?); accepted only where {x:(a,r,x)GT1}=1|\{ x : (a, r, x) \in G_{T_1} \}| = 1. E.g., “In which country will the ICLR 2026 conference be held?” for the triple (ICLR2026, country, Brazil).
    • L2 (Multi-constraint/Compositional): Intersects anchor triples to ensure exactly one unique answer. Sets S1,S2,S_1, S_2, \dots are formed over intersecting entities xx; S1S2=1|S_1 \cap S_2| = 1. Example: “Which football player has played for Real Madrid, Juventus, and Al Nassr?”
    • L3 (Fuzzing + Extra Hop): Broadens constraints (“fuzzing”) and adds another relational path; e.g., replacing an entity with a category ("Saudi Arabian club") and introducing an additional hop, enforcing uniqueness across all constraints.
  4. Final Rendering and SPARQL Validation: Transforms questions using latest labels and validates answer uniqueness via SPARQL queries. Snapshots and SPARQL results are logged for exact traceability.

The pipeline’s pseudocode is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
def generate_benchmark(dump_T0, dump_T1):
  G0, G1 = extract_triples(dump_T0), extract_triples(dump_T1)
  delta = (G1 - G0)  updates(G0, G1)
  candidates = filter_triples(delta)
  instances = []
  for t in candidates:
    for level in ["L1", "L2", "L3"]:
      q = synthesize_question(t, G1, level)
      if q and validate_sparql(q.sparql, G1):
        instances.append(q)
        break
  return instances

3. SPARQL-based Validation and Answer Uniqueness

Every generated question is automatically associated with a SPARQL query that ensures a unique, verifiable answer in GT1G_{T_1}:

  • L1:

1
2
3
SELECT ?b
WHERE { wd:Q_a wdt:P_r ?b }
LIMIT 2
Accepted only if exactly one row is returned.

  • L2:

1
2
3
4
SELECT ?x
WHERE { { wd:Q_a1 wdt:P_r1 ?x } UNION { wd:Q_a2 wdt:P_r2 ?x } }
GROUP BY ?x
HAVING (COUNT(?x)=2)

  • L3:

1
2
3
4
5
6
7
8
9
10
SELECT ?x
WHERE {
  { wd:Q_a1 wdt:P_r1 ?x FILTER(fuzz1(?x)) }
  UNION
  { wd:Q_a2 wdt:P_r2 ?x FILTER(fuzz2(?x)) }
  UNION
  { ?x wdt:P_r3 wd:Q_c }
}
GROUP BY ?x
HAVING (COUNT(?x)=3)
This methodology enforces that every question is anchored in the most current state of Wikidata and is unambiguous with respect to the answer set cardinality.

4. Experimental Design and Evaluation Metrics

LiveSearchBench assessments use distinct temporal splits, e.g.,

Dataset Snapshot Range L1 L2 L3
LiveSearchBench-2021 Sep 2021 → Dec 2021 150 100 50
LiveSearchBench-2025 May 2025 → Aug 2025 150 100 50

Model types evaluated:

  • Vanilla Prompting: Direct Answer and Chain-of-Thought methods on instruction-tuned Llama 3.2-3B, Qwen 2.5-3B, 7B, and 14B.
  • Retrieval-augmented: RAG, Search-o1, Search-R1 (RL-trained base & instruction-tuned), SSRL (self-search RL).

Metrics:

  • Exact Match (EM): EM=1Ni1{y^i=yi}EM = \frac{1}{N}\sum_i 1\{\hat{y}_i = y_i\}
  • Pass@kk: For parametric inference, the rate at which the correct answer appears in the top-kk outputs.
  • Retrieval Recall@kk: Fraction of questions where the gold answer is among top-kk retrieved documents.
  • Recency Gap: RecencyGap=EM2021EM2025\mathrm{RecencyGap} = EM_{2021} - EM_{2025}

5. Empirical Findings and Analytical Insights

Key results from systematic evaluation include:

  • Recency Gap: All methods showed significant decrease in EM from 2021 to 2025 batches. For 3B models/methods: EM2021=28.8%EM_{2021} = 28.8\%, EM2025=12.2%EM_{2025} = 12.2\%, Gap16.6Gap \approx 16.6 points. This drop is most pronounced on multi-hop levels (L2: 14.1-14.1, L3: 7.5-7.5).
  • Retrieval-augmented Superiority: Methods such as RAG, Search-o1, Search-R1, and SSRL yield an absolute +15–20 point gain in EM on the 2025 batch relative to prompting, corresponding to 200–300% relative gains. On 2021, the gain is only +5–10 points. This underscores retrieval's indispensability for facts outside the pretraining window.
  • Instruction Tuning and Model Scale: Instruction-tuned Search-R1 outperforms base Search-R1 by 5–8 points on 2025 data. Scaling model size (3B → 7B → 14B) consistently increases EM (e.g., Qwen 2.5-14B RAG: 52.8% on 2021, 26.6% on 2025), but does not eliminate the recency gap—top 14B instruction-tuned Search-R1 achieves 28.4% EM on 2025, nearly 25 points below 2021.
  • Single- vs. Multi-hop Reasoning: Accuracy consistently declines from L1 to L3 in both splits (e.g., Qwen 14B-RAG, 2025: L1 = 34.7%, L2 = 27.0%, L3 = 18.0%). This reflects increased failure rates in multi-hop settings, suggesting heightened vulnerability to stale or misleading evidence in compositional reasoning scenarios.

6. Conceptual Significance and Future Directions

LiveSearchBench reorients QA evaluation from static, memorization-focused paradigms to dynamic, retrieval-centered tasks. By enforcing verified, temporally grounded answers and leveraging continual delta-based regeneration, it provides the following conceptual advances:

  • Ambiguity Elimination: Every question is anchored in a SPARQL COUNT=1 proof, reducing ambiguity to zero.
  • Temporal Grounding: Queries are intrinsically tied to the knowledge state at T1T_1, ensuring validity with respect to information recency.
  • Evidence Acquisition Requirement: The benchmark structure enforces genuine evidence retrieval, not just parametric recall.

This suggests that LiveSearchBench provides a systematic, low-touch platform for longitudinal evaluation of LLMs on temporally dynamic knowledge, crucial for research in retrieval-augmented, RL-trained, and lifelong learning models.

A plausible implication is the permanent shift toward benchmarks that can regenerate and maintain relevance as world knowledge and LLM architectures evolve. As a consequence, LiveSearchBench is positioned to become a foundational asset for research into dynamic knowledge integration, time-sensitive reasoning, and real-time QA over evolving information landscapes.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LiveSearchBench.