DSEBench: Explainable Dataset Search Benchmark

Updated 27 October 2025

DSEBench is a comprehensive benchmark for explainable dataset search that combines keyword queries with example-based similarity to retrieve datasets and justify selections using field-level explanations.
It leverages a mix of human annotations and LLM-generated labels through rigorous filtering and pooling protocols to create a high-quality test collection with graded relevance.
The benchmark supports evaluation of sparse, dense, and LLM-based retrieval models, enabling precise measurement of both retrieval performance and explanation quality.

DSEBench is a test collection and benchmark specifically designed for the evaluation of explainable dataset search with examples, a task that generalizes traditional dataset search by requiring both the matching of candidate datasets to a combination of keyword queries and example datasets and the identification of dataset fields that justify their retrieval. DSEBench provides high-quality dataset-level and field-level annotations, enabling rigorous evaluation of retrieval and explanation models, and establishes a comprehensive suite of baselines across sparse, dense, and LLM-based retrieval and explanation paradigms (Shi et al., 20 Oct 2025).

1. Foundations of Dataset Search with Examples (DSE)

Traditional dataset search operates in two main paradigms: keyword-driven search, where datasets are retrieved based on textual queries, and similarity-driven search, where datasets similar to a provided example are identified. DSEBench is built around the more general framework of Dataset Search with Examples (DSE), where the input is a tuple $(q, D_t)$ with $q$ a textual query and $D_t$ a set of example (target) datasets. The system’s objective is to retrieve a ranked list $D_c$ of candidate datasets that satisfy both query relevance and similarity constraints with respect to $D_t$ . The task is further extended to an "explainable" setting: for each candidate $d \in D_c$ , the system must highlight a subset $F_d$ of dataset fields such that $F_d \subseteq \{\text{title, description, tags, author, summary}\}$ , which constitute the explanations for the match.

Formally, the task is defined as:

Retrieval:

Given $(q, D_t)$ , return $D_c$ where datasets $d \in D_c$ are relevant to $q$ and similar to $D_t$ .

Explanation:

For each $d \in D_c$ , provide $F_d$ such that $F_d \subseteq \{\text{title, description, tags, author, summary}\}$ .

The graded relevance of each candidate triple $(q, D_t, d)$ combines orthogonal evaluations of query relevance and target similarity, using $\text{Grade}(q, d, D_t) = \text{Rel}(q,d) \times \text{Sim}(d, D_t)$ where $\text{Rel}(q, d), \text{Sim}(d, D_t) \in \{0,1,2\}$ , leading to four possible grades: $\{0,1,2,4\}$ .

2. Construction of the DSEBench Collection

DSEBench is constructed by adapting and augmenting the NTCIR dataset search test collection. Each dataset is represented across five fields: title, description, tags, author, and a summary. The summary field is algorithmically generated using text summarization and schema extraction, employing BART-large-cnn models for textual files and heuristic extraction for tabular headers. The design accommodates heterogeneous data sources (PDF, CSV, HTML, XML, JSON, RDF), with file format-dependent preprocessing pipelines.

Test cases in DSEBench are derived from 141 highly relevant NTCIR query–dataset pairs, supplemented by 1,434 partially relevant triples. To extend the training set, the authors generate a large volume of synthetic queries via a T5-base model fine-tuned on the manually labeled data, applying a rigorous filtering process to eliminate trivial or irrelevant samples. For annotation, a pooling protocol retrieves candidate datasets using multiple retrieval models. Human annotators provide dataset- and field-level judgments for approximately 7,415 test triples, while a large LLM (GLM-3-Turbo) with heuristic quality controls generates annotations for 274,293 training triples.

3. Explainable Retrieval and Field-Level Annotation

Explainable DSE, as instantiated in DSEBench, requires not only identifying which datasets are returned but also articulating which fields—among title, description, tags, author, and summary—are responsible for their retrieval. This second dimension enables both interpretability at the decision level and evaluation of explanation quality via field-level F1 matching to gold-standard annotations.

Candidate explanations $F_d$ may be informed by direct query overlap, target–candidate semantic similarity, or both. The collection and its methodology allow for evaluation of explainers by comparing predicted $F_d$ sets to human annotations, supporting approaches such as feature ablation, surrogate post hoc explanation (e.g., LIME, SHAP), and prompt-based LLM explainers.

4. Baseline Models and Evaluation Protocols

DSEBench provides robust baselines in three major categories:

First-stage retrieval: Baselines include sparse vector models (BM25, TF-IDF), dense embedding models (BGE, GTE), and supervised dense retrieval approaches (DPR, ColBERTv2, coCondenser). A relevance-feedback approach (Rocchio) is also analyzed.
Re-ranking models: BGE-reranker, Stella, SFR, and LLM-based multi-layer re-rankers are implemented, showing notable improvements when fine-tuned with a mixture of human- and LLM-provided labels.
Explanation models: Field-level explanations are produced via feature ablation, LIME, SHAP, zero-shot LLM prompts, and few-shot prompt templates.

Evaluation metrics include MAP@5, MAP@10, NDCG@5, NDCG@10 for retrieval, and F1 score for field-level explanation matching. The unique graded relevance labels based on relevance × similarity enable more nuanced assessment than binary or categorical labels.

5. Applications and Implications

DSEBench enables research and development of dataset search engines and retrieval models that leverage both keyword-based and example-based search paradigms, a capability essential for discovery in open data, academic portals, and data repositories. The field-level explainability facilitates transparency and user trust, supporting contexts in which provenance and justification for search results are critical (e.g., governmental and scientific data access). The test collection’s construction procedure and annotation methodologies provide a foundation for further research into explainable IR, combined retrieval-task evaluation, and structured query/model fusion.

Furthermore, the comprehensive labeling and pooling protocol underpinning DSEBench, spanning both highly curated human annotations and filtered LLM-generated labels, enable the benchmarking of supervised and semi-supervised models—even when training data is sparse for specialized domains.

6. Technical Formulations and Model Integration

Key LaTeX-based formalizations include the field explanation subset definition and the composite graded scoring function. Retrieval models must integrate two axes—query relevance and example similarity—into ranking logic, while explanation models must solve a multi-label classification over the candidate’s fields. This setting fosters developments in fusion architectures (e.g., bi-encoders and cross-encoders that attend to heterogeneous input types), field-aware scoring, and joint optimization of ranking and explanation loss.

The DSEBench methodology and design—data generation, candidate pooling, human/LLM annotation, and separation of retrieval and explanation evaluation—offer an empirically rigorous framework for future research on explainable dataset search, model introspection, and structured IR. Continual advances in LLM-based retrieval and surrogate explainers can be directly measured with DSEBench’s annotated benchmark, setting a de facto standard for the field’s explainability subtask.

7. Impact and Future Directions

By enabling precise measurement of both retrieval and explainability on dataset search with examples, DSEBench fills a critical methodological gap. It lends itself to benchmarking multi-modality fusion models, domain adaptation methods, and user-centric explainability frameworks. The layered annotation—dataset-level and field-level, manual and LLM-driven—supports a wide array of experimental setups for both supervised and zero/few-shot regimes.

A plausible implication is that DSEBench, with its focus on richly structured metadata, will drive the development of models that are not only performant but also transparent, supporting both research needs and real-world applications where interpretability is paramount.

PDF Markdown Chat (Pro)

References (1)

DSEBench: A Test Collection for Explainable Dataset Search with Examples (2025)

Follow Topic

Get notified by email when new papers are published related to DSEBench.