Dataset Search with Examples (DSE)
- Dataset Search with Examples (DSE) is a retrieval framework that integrates keyword queries and exemplar datasets to capture both explicit needs and implicit semantic preferences.
- It combines ad hoc search and similarity-based discovery to improve dataset matching where incomplete metadata or complex requirements are common.
- Explainable DSE additionally provides field-level justifications, enhancing transparency and aiding data curation, integration, and auditing in research workflows.
Dataset Search with Examples (DSE) refers to a generalized framework for dataset retrieval that enables users to specify both textual queries and one or more example datasets as input, combining the strengths of ad hoc (keyword-based) search and similarity-based retrieval. This paradigm supports more nuanced information needs, allowing users to articulate requirements by providing queries that describe their tasks alongside exemplar datasets that express implicit semantic, structural, or domain-specific preferences. The DSE task has recently been extended to explainable DSE, where systems are also required to identify and justify, at the field level, the evidence underlying relevance and similarity judgments—a critical step toward transparent and user-understandable dataset discovery.
1. Definition and Motivation
Dataset Search with Examples (DSE) formally encompasses two previously distinct dataset search paradigms:
- Keyword-based retrieval: Returns datasets matching a textual query describing an information need.
- Similarity-based discovery: Finds datasets resembling a provided target (e.g., in schema, content, or metadata).
In DSE, the user specifies both a natural language query and one or more example datasets , enabling the system to jointly consider explicit (query-based) and implicit (example-driven) criteria for dataset relevance. The task is formally defined over triples , where is a candidate dataset that must be ranked by its combined relevance to and similarity to . Explainable DSE further extends this by requiring systems to identify the metadata and content fields of that serve as justification for its selection—i.e., what aspects of are responsible for its ranking with respect to the user’s articulated and exemplified needs (Shi et al., 20 Oct 2025).
This combined paradigm supports realistic dataset exploration workflows, where users are often unable to formulate all requirements upfront but can provide illustrative datasets and iterative queries that capture desired properties.
2. DSEBench: Test Collection and Annotation Protocol
DSEBench is a publicly available benchmark designed to facilitate evaluation and development of systems for both DSE and explainable DSE (Shi et al., 20 Oct 2025). Its key components are:
- Corpus: The English subset of the NTCIR dataset collection, containing 46,000+ datasets with metadata and content fields.
- Test Cases: 141 evaluation cases are constructed by pairing each query with at least one manually curated highly relevant target dataset.
- Training Cases: 5,700+ generated pairs (query, target) using a fine-tuned T5-base model for synthetic data augmentation, in addition to original NTCIR annotations.
- Annotation Pool: For each test triple , pooled candidate datasets are drawn from four retrieval models (BM25, TF-IDF, and two dense embedding models), and annotated by human experts for:
- Query relevance: How well matches the textual query.
- Target similarity: How similar is to the example(s).
- Field-level explanations: For partially or highly relevant/similar candidates, annotators mark which fields (title, description, tags, author, summary) provide the primary evidence for their assessment.
Annotations are collected on a granular scale: query and target similarity are each graded , and the overall relevance score is defined as their product, i.e.,
Thus, only datasets that are both highly relevant and highly similar can achieve the maximum score. For supervised training, more than 300,000 LLM-annotated triples are synthesized using GLM-3-Turbo, then filtered heuristically.
3. Baseline Retrieval, Reranking, and Explanation Methods
Extensive baselines are evaluated for both the retrieval and explainability dimensions:
- Retrieval Models:
- Unsupervised: BM25, TF-IDF (sparse), and dense models based on pre-trained text embeddings (BGE and GTE).
- Supervised: Dense retrievers (DPR, ColBERTv2, coCondenser) fine-tuned on DSEBench data split into five annotator folds.
- Relevance Feedback: The Rocchio algorithm adapts the representation of by incorporating positives from as pseudo-relevance feedback.
- Reranking Models:
- Classic rerankers include Stella, SFR, and the BGE-reranker, which are fine-tuned on DSEBench.
- LLM-based reranking methods are explored in zero-shot, one-shot, and “multi-layer” grouping setups. The multi-layer approach divides candidate sets into subgroups to aid the LLM in focusing on shorter contexts and performs iterative reranking.
- Explanation Methods:
- Feature ablation: Each field is systematically omitted, and the impact on retrieval scores is measured; fields whose removal significantly affects the score are attributed as indicators.
- Surrogate Model Explanations: LIME and SHAP are adapted to explain dense model predictions by training local surrogate models over perturbed field combinations. SHAP consistently outperforms other methods, especially for target similarity explanations.
- LLM-based: Zero- and few-shot explanation generation using GLM-4-Air, with few-shot prompting showing the highest field-level F1-scores, particularly in identifying primary fields such as “description.”
Aggregate evaluation metrics include MAP@5, NDCG@5, and Recall@5 for retrieval, and field-level F1 for explainability.
4. Empirical Findings and Model Insights
Key empirical results from the DSEBench paper (Shi et al., 20 Oct 2025):
- Fine-tuned dense retrievers (coCondenser, ColBERTv2, and DPR) demonstrate superior performance to both sparse baselines and off-the-shelf dense models, with MAP@5 exceeding 0.13 on the fine-tuned split.
- Supervised rerankers consistently outperform classical rankers, with the BGE-reranker achieving the highest accuracy when fine-tuned on human-labeled splits.
- For explainability, SHAP yields robust attribution for target similarity, while LLM-based few-shot explanations are most effective for query relevance indicators.
- The “description” field emerges as the most frequently and consistently marked indicator in field-level explanations, across both human and automated annotators.
- LLMs (GLM-3-Turbo, T5-base, GLM-4-Plus) are successfully leveraged for both large-scale annotation and as black-box rerankers/explainers, demonstrating scalable weak supervision and strong performance in field attribution tasks.
5. Technical Implementation and Evaluation Protocol
The technical design enforces several protocol decisions:
- Pseudo-documents are constructed by concatenating all fields of a dataset. For candidate ranking, the query is “expanded” with pseudo-docs of the target datasets .
- For dense retrieval, cosine similarity between the query-plus-example vector and candidate dataset representations is used.
- Hyperparameters for dense model fine-tuning (e.g., learning rate , batch size 8/16, input truncation at 512 tokens) are grid searched.
- The pooling strategy restricts the annotation effort by focusing on top-ranked candidates from diverse retrievers.
6. Significance and Directions for Future Research
DSEBench establishes a general-purpose, high-resolution resource for both system development and evaluation in DSE and explainable DSE. By jointly considering user queries and example datasets, and by requiring field-level explanation, the benchmark invites advances in:
- Hybrid retrieval models that integrate both explicit query- and implicit example-based information.
- Interpretable and explainable IR, particularly the development of field- or feature-attribution methods suited for dataset search scenarios.
- Large-scale synthetic annotation using LLMs to enable supervised learning and rapid prototyping.
- More expressive evaluation metrics capturing not just relevance, but the fidelity and interpretability of explanation.
The paradigm addresses persistent constraints in dataset search, including insufficient context in keyword queries, incomplete or poor-quality metadata, and lack of transparency in ranking. By spanning both retrieval and explanation, DSEBench supports research toward systems that maximize utility, trust, and user control in datacentric workflows.
7. Practical Applications and Implications
Explainable DSE supports several advanced use cases:
- Data curation: Facilitates matching of complex data needs—e.g., when extending or augmenting an analysis by specifying a prototypical input dataset.
- Data integration: Enables the discovery of candidate datasets that align in content or schema with a given example, supporting multi-source data construction.
- Transparency and auditing: Field-level explanations enable users to audit the evidence underlying dataset recommendations, essential for scientific reproducibility and regulatory compliance.
- Hybrid interactive search: Systems can expose which metadata or content fields are driving results, guiding users in query reformulation and selection.
This generalizes and enhances application patterns from keyword-based portals and schema search engines, fostering richer, explainable dataset discovery workflows.
In summary, Dataset Search with Examples (DSE) and explainable DSE represent a matured paradigm in dataset retrieval that combines user queries with example datasets and demands transparency in result justification. The DSEBench benchmark provides the necessary test collection and empirical foundation for reference and advances in this area, supporting rigorous evaluation of both retrieval effectiveness and explainability at the field level (Shi et al., 20 Oct 2025).