Deep Research Gym API Overview

Updated 7 April 2026

Deep Research Gym API is a tool that enables researchers to evaluate semantic parsing models using the DBLP-QuAD dataset on scholarly knowledge graphs.
It employs a template-driven approach to generate validated natural-language questions paired with executable SPARQL queries for comprehensive benchmarking.
The API supports robust model evaluation with metrics such as exact-match accuracy and Answer F1, demonstrating significant improvements with state-of-the-art models.

The Deep Research Gym API is a tool designed to support the research community's efforts in building and evaluating semantic-parsing-based question answering systems over scholarly knowledge graphs, specifically leveraging the DBLP-QuAD dataset—a large, expertly curated collection of question–answer pairs constructed atop the DBLP Scholarly Knowledge Graph. The API enables the programmatic interaction with this resource, providing data well-suited for benchmarking knowledge-graph question answering (KGQA) models in the bibliographic domain of computer science (Banerjee et al., 2023).

1. Dataset Foundation: ScholarQA-CS (DBLP-QuAD)

Central to the Deep Research Gym API's utility is its association with the DBLP-QuAD dataset. DBLP-QuAD, also referred to as ScholarQA-CS, comprises 10,000 question–answer pairs, each mapped to an executable SPARQL query. The dataset's construction adheres to a two-stage, template-driven approach analogous to the OVERNIGHT framework. Specifically, domain experts authored 98 "template tuples," which include SPARQL query templates, multiple semantically equivalent natural-language templates, explicit entity placeholders (such as [CREATOR_NAME] and [VENUE]), and annotation flags for temporal or evaluation-specific usage. These templates were then instantiated over randomly sampled two-hop subgraphs from the DBLP RDF dump, with string-level augmentations to promote naturalistic variation (e.g., use of different name variants and abbreviation patterns).

The instantiation pipeline enforced executable correctness by discarding any (question, SPARQL) pairs that failed to yield a nonempty result on a local Virtuoso SPARQL endpoint. This step ensured that all provided answers are validated against the underlying graph (Banerjee et al., 2023).

2. Structural Schema and Data Organization

The knowledge graph underpinning the Deep Research Gym API centers on two main RDF classes: Person (creators, authors) and Publication (conference and journal articles). The associated predicates provide bidirectional links (e.g., authoredBy, authorOf), literal metadata (title, year, affiliation), publication venues, and auxiliary identifiers such as DOI, ORCID, and Wikidata.

The DBLP-QuAD dataset is systematically organized into ten query types, each evenly divided between creator-centric and publication-centric variants, as outlined below:

Query Type	Semantic Focus	Example Category
Single Fact	Entity attribute	Publication year
Multiple Facts (joins)	Combined relations	Author and venue
Boolean	Existence/test relation	Author has papers
Negation	Absence/anti-query	Not affiliated
Double Negation	Nested exclusion	Not (not authored)
Double Intent	Multi-slot query	Author and affiliation
Union	Multi-entity join	Authored by X or Y
Count	Numeric aggregation	Number of papers
Superlative/Comparative	Ranking/extent	Earliest/most papers
Disambiguation	Clarifying ambiguity	Author with affiliation

Each question instance presents two natural-language formulations, an executable SPARQL query, explicit lists of utilized KG entities and predicates, and the resultant answer in JSON format. Data splits adhere to a 70%/10%/20% partition for train/validation/test, with an explicit control to ensure that approximately 19–20% of validation and test instances are structurally "zero-shot"—instantiated from template tuples not seen in the training set. The splitting was performed such that validation and test sets contain 82–81% i.i.d. instances, 13–15% compositional, and 3–4% zero-shot based on template exposure (Banerjee et al., 2023).

3. Semantic Parsing and Evaluation Protocol

Evaluation on the DBLP-QuAD dataset within the Deep Research Gym context is framed as a semantic parsing challenge: given a natural-language question (plus optionally linked entity/predicate URIs), the system must produce an exact SPARQL query. Two principal evaluation metrics are defined:

Exact-match accuracy: Proportion of generated SPARQL queries that match the gold-standard query token-for-token (whitespace-normalized).
Answer F1: For each test instance, both predicted and reference SPARQL queries are executed against the local DBLP KG. Precision, recall, and their harmonic mean F1 are computed over the resulting answer sets:

$\mathrm{F1} = \frac{2 \cdot P \cdot R}{P + R}$

Baselines based on T5-Small and T5-Base provide initial benchmarks. On the held-out test set, results were as follows:

Model	Exact-match	Answer F1
T5-Small	0.638	0.721
T5-Base	0.813	0.868

These outcomes demonstrate a marked improvement in both syntactic and semantic fidelity using the larger T5-Base model (Banerjee et al., 2023).

4. Methodology: Template Generation and Augmentation

The core methodology underpinning Deep Research Gym API interactions is the template-driven generation pipeline. Each template tuple is manually authored to specify a parameterized SPARQL query, a suite of natural-language question variants, and the included KG entity and predicate slots. Entity instantiation draws from random two-hop subgraphs to achieve representative, context-rich questions.

Literal fields are post-processed using string-level augmentations:

Name variants (e.g., "Smith, John William" vs. "J. W. Smith")
Full/abbreviated venue names (e.g., "European Conference..." vs. "^{^{^{^{5^{^{^{^")}}}}}}}
Numeric vs. textual numerals (e.g., "5" vs. "five")
Full addresses vs. institution names for affiliations
Partial paper titles to assist in disambiguation

Templates reserved for validation and test phases ensure the ability to probe generalization. Each candidate pair is filtered through a local SPARQL endpoint to guarantee executable, nonempty answers (Banerjee et al., 2023).

5. Application Domains and Evaluation Use Cases

The Deep Research Gym API, via the DBLP-QuAD corpus, serves several investigational functions:

KGQA semantic parsing benchmarks: The controlled dataset structure allows for the rigorous evaluation of parsers on a diverse suite of compositional and reasoning tasks, including temporal, aggregation, and disjunctive queries.
Generalization analysis: Withheld templates and paraphrases facilitate studies of zero-shot and compositional generalization in neural and symbolic QA models.
Entity/linking system training: The inclusion of multiple name forms and paraphrased questions enables the development of joint entity linking and query parsing approaches.
Scholarly knowledge exploration: The KG's rich schema provides testbeds for domain-specific retrieval and bibliometric analytics.

This suggests the API's most significant contributions are in standardized evaluation and in facilitating robust development of models for domain-centric semantic parsing and QA (Banerjee et al., 2023).

6. Limitations and Reproducibility Considerations

Several known limitations define current boundaries of applicability:

All natural-language questions derive from machine instantiations of a finite set of human-written templates, authored by only two individuals. This restricts the diversity of user intents and reduces ecological validity.
Bibliographic entity linking is constrained by the use of literal titles, as abbreviated names (for example, "GPT-3," "BERT") are not linked, complicating robust entity resolution in practical deployments.
Structural leakage is present, as approximately 80% of the templates employed for validation and testing also appear in the training set; this is partially offset by the inclusion of novel paraphrases and surface-level augmentations.
Reproducibility is hampered by the absence of a public SPARQL endpoint for the live DBLP KG; users are required to load the DBLP RDF dump locally to reproduce the full semantic parsing and answer retrieval process.

A plausible implication is that, despite its comprehensive scope, the Deep Research Gym API's present architecture emphasizes controlled, template-driven evaluation over open-domain QA or dynamic real-world interactions (Banerjee et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Research Gym API.