LiveSearchBench: Temporal QA Benchmark
- LiveSearchBench is a dynamic benchmark that evaluates large language models on question answering tasks requiring fresh, retrieval-based knowledge.
- The benchmark employs a systematic four-stage pipeline, including differential extraction, high-quality candidate filtering, hierarchical question synthesis across L1 to L3, and SPARQL validation to ensure answer uniqueness.
- Empirical results demonstrate that retrieval-augmented methods significantly outperform direct prompting, highlighting the importance of evidence acquisition for temporal reasoning.
LiveSearchBench is a fully automated, dynamically regenerable benchmark designed to evaluate LLMs on question answering (QA) tasks that strictly require retrieval and reasoning over newly introduced knowledge, as opposed to static, memorization-centric evaluation. By systematically constructing queries from temporal deltas between Wikidata snapshots and enforcing unique, verifiable answers via SPARQL validation, LiveSearchBench provides a principled methodology for assessing models on facts that unequivocally post-date any model’s pretraining corpus, ensuring temporally grounded and retrieval-dependent QA evaluation.
1. Formalization of Temporal Deltas and Data Filtering
The core of LiveSearchBench is the precise definition and utilization of the "delta" () between two temporal Wikidata knowledge graph snapshots. Let denote the set of subject–predicate–object triples extracted from a Wikidata snapshot at time . The delta capturing new and updated knowledge between is defined as:
Here, corresponds to insertions (entirely new triples), while captures updated objects for pre-existing pairs. Every triple in introduces knowledge absent at .
To ensure data suitability for question synthesis, LiveSearchBench applies the following filters:
- Relation Allow-list: Exclusion of trivial/formatting predicates (e.g., P31, metadata), using a comprehensive block-list.
- Entity Quality: Entities must have multilingual labels or aliases and robust descriptive metadata; deprecation, disambiguation, or severe ambiguity serve as exclusion criteria.
- Statement Validity & De-duplication: Deprecated or rank-0 statements are removed, and near-duplicate assertions are merged using normalized keys or statement IDs.
This pipeline produces a high-quality, noise-reduced pool of temporally novel triples.
2. Automated Generation Pipeline and Reasoning Complexity
LiveSearchBench employs a four-stage pipeline to create QA instances:
- Differential Extraction: Extracts and , computes .
- High-quality Candidate Filtering: Filters as described above.
- Hierarchical Question Synthesis: Constructs natural-language questions at three predefined reasoning levels:
- L1 (Single-hop): Direct queries over a triple ; accepted only where . E.g., “In which country will the ICLR 2026 conference be held?” for the triple (ICLR2026, country, Brazil).
- L2 (Multi-constraint/Compositional): Intersects anchor triples to ensure exactly one unique answer. Sets are formed over intersecting entities ; . Example: “Which football player has played for Real Madrid, Juventus, and Al Nassr?”
- L3 (Fuzzing + Extra Hop): Broadens constraints (“fuzzing”) and adds another relational path; e.g., replacing an entity with a category ("Saudi Arabian club") and introducing an additional hop, enforcing uniqueness across all constraints.
- Final Rendering and SPARQL Validation: Transforms questions using latest labels and validates answer uniqueness via SPARQL queries. Snapshots and SPARQL results are logged for exact traceability.
The pipeline’s pseudocode is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 |
def generate_benchmark(dump_T0, dump_T1): G0, G1 = extract_triples(dump_T0), extract_triples(dump_T1) delta = (G1 - G0) ∪ updates(G0, G1) candidates = filter_triples(delta) instances = [] for t in candidates: for level in ["L1", "L2", "L3"]: q = synthesize_question(t, G1, level) if q and validate_sparql(q.sparql, G1): instances.append(q) break return instances |
3. SPARQL-based Validation and Answer Uniqueness
Every generated question is automatically associated with a SPARQL query that ensures a unique, verifiable answer in :
- L1:
1 2 3 |
SELECT ?b
WHERE { wd:Q_a wdt:P_r ?b }
LIMIT 2 |
- L2:
1 2 3 4 |
SELECT ?x
WHERE { { wd:Q_a1 wdt:P_r1 ?x } UNION { wd:Q_a2 wdt:P_r2 ?x } }
GROUP BY ?x
HAVING (COUNT(?x)=2) |
- L3:
1 2 3 4 5 6 7 8 9 10 |
SELECT ?x
WHERE {
{ wd:Q_a1 wdt:P_r1 ?x FILTER(fuzz1(?x)) }
UNION
{ wd:Q_a2 wdt:P_r2 ?x FILTER(fuzz2(?x)) }
UNION
{ ?x wdt:P_r3 wd:Q_c }
}
GROUP BY ?x
HAVING (COUNT(?x)=3) |
4. Experimental Design and Evaluation Metrics
LiveSearchBench assessments use distinct temporal splits, e.g.,
| Dataset | Snapshot Range | L1 | L2 | L3 |
|---|---|---|---|---|
| LiveSearchBench-2021 | Sep 2021 → Dec 2021 | 150 | 100 | 50 |
| LiveSearchBench-2025 | May 2025 → Aug 2025 | 150 | 100 | 50 |
Model types evaluated:
- Vanilla Prompting: Direct Answer and Chain-of-Thought methods on instruction-tuned Llama 3.2-3B, Qwen 2.5-3B, 7B, and 14B.
- Retrieval-augmented: RAG, Search-o1, Search-R1 (RL-trained base & instruction-tuned), SSRL (self-search RL).
Metrics:
- Exact Match (EM):
- Pass@: For parametric inference, the rate at which the correct answer appears in the top- outputs.
- Retrieval Recall@: Fraction of questions where the gold answer is among top- retrieved documents.
- Recency Gap:
5. Empirical Findings and Analytical Insights
Key results from systematic evaluation include:
- Recency Gap: All methods showed significant decrease in EM from 2021 to 2025 batches. For 3B models/methods: , , points. This drop is most pronounced on multi-hop levels (L2: , L3: ).
- Retrieval-augmented Superiority: Methods such as RAG, Search-o1, Search-R1, and SSRL yield an absolute +15–20 point gain in EM on the 2025 batch relative to prompting, corresponding to 200–300% relative gains. On 2021, the gain is only +5–10 points. This underscores retrieval's indispensability for facts outside the pretraining window.
- Instruction Tuning and Model Scale: Instruction-tuned Search-R1 outperforms base Search-R1 by 5–8 points on 2025 data. Scaling model size (3B → 7B → 14B) consistently increases EM (e.g., Qwen 2.5-14B RAG: 52.8% on 2021, 26.6% on 2025), but does not eliminate the recency gap—top 14B instruction-tuned Search-R1 achieves 28.4% EM on 2025, nearly 25 points below 2021.
- Single- vs. Multi-hop Reasoning: Accuracy consistently declines from L1 to L3 in both splits (e.g., Qwen 14B-RAG, 2025: L1 = 34.7%, L2 = 27.0%, L3 = 18.0%). This reflects increased failure rates in multi-hop settings, suggesting heightened vulnerability to stale or misleading evidence in compositional reasoning scenarios.
6. Conceptual Significance and Future Directions
LiveSearchBench reorients QA evaluation from static, memorization-focused paradigms to dynamic, retrieval-centered tasks. By enforcing verified, temporally grounded answers and leveraging continual delta-based regeneration, it provides the following conceptual advances:
- Ambiguity Elimination: Every question is anchored in a SPARQL COUNT=1 proof, reducing ambiguity to zero.
- Temporal Grounding: Queries are intrinsically tied to the knowledge state at , ensuring validity with respect to information recency.
- Evidence Acquisition Requirement: The benchmark structure enforces genuine evidence retrieval, not just parametric recall.
This suggests that LiveSearchBench provides a systematic, low-touch platform for longitudinal evaluation of LLMs on temporally dynamic knowledge, crucial for research in retrieval-augmented, RL-trained, and lifelong learning models.
A plausible implication is the permanent shift toward benchmarks that can regenerate and maintain relevance as world knowledge and LLM architectures evolve. As a consequence, LiveSearchBench is positioned to become a foundational asset for research into dynamic knowledge integration, time-sensitive reasoning, and real-time QA over evolving information landscapes.