Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 21 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

DeepScholar-base: Automated Scholarly Synthesis

Updated 29 August 2025
  • DeepScholar-base is a modular, automated research synthesis pipeline that integrates iterative query generation, semantic filtering, and document aggregation to produce coherent literature reviews.
  • It employs LOTUS semantic operators for filtering and ranking candidate documents, ensuring high relevance and verifiability of retrieved scholarly content.
  • Evaluated on metrics like knowledge synthesis, retrieval quality, and claim verifiability, DeepScholar-base sets a rigorous benchmark for AI-driven research assistants.

DeepScholar-base is a modular, reference pipeline designed for generative research synthesis—specifically to autonomously generate long-form, citation-rich related work sections by retrieving and synthesizing knowledge from live scholarly sources. Developed as part of the DeepScholar-bench project (Patel et al., 27 Aug 2025), DeepScholar-base defines an automated system for query generation, scholarly retrieval, semantic filtering, and aggregation to address the complexities of real-world research synthesis. Its evaluation framework assesses multidimensional performance—covering knowledge synthesis, retrieval quality, and verifiability—setting a holistic and challenging benchmark for AI-driven research assistants.

1. System Architecture and Methodology

The DeepScholar-base pipeline is structured as a sequence of modular components, each corresponding to a core phase in generative research synthesis:

  1. Query Generation: An LLM generates a set of iterative, diverse search queries from an input (typically the abstract of a target paper). Prompts are structured to elicit 1–10 keyword queries to maximize coverage of the target research topic.
  2. Retrieval: Each generated query is executed against a scholarly search API (in current evaluation, the ArXiv API) to obtain a batch of candidate documents.
  3. Filtering and Ranking: LOTUS semantic operators are employed to assess and rank the candidate abstracts:
    • Sem-Filter determines topical relevance by filtering papers whose abstracts align with the information needs expressed by the query.
    • Sem-TopK ranks and selects the top-k most relevant candidates.
  4. Aggregation and Synthesis: The Sem-Agg operator fuses the filtered papers, synthesizing their content into a logically organized, citation-rich related work section with inline references.

The pipeline explicitly decomposes synthesis into distinct, auditable steps. By integrating iterative query reasoning with retrieval and semantic filtering, DeepScholar-base operationalizes complex, multi-source aggregation—addressing the fundamental requirement of both breadth and depth in literature reviews.

2. Evaluation Framework and Metrics

DeepScholar-base is evaluated using DeepScholar-bench (Patel et al., 27 Aug 2025), a benchmark that emphasizes the holistic demands of generative research synthesis. The evaluation decomposes system outputs across three principal axes—each measured by multiple, automated metrics:

a. Knowledge Synthesis

  • Organization: Assessed via LLM-as-a-judge pairwise comparisons, focusing on logical structure, coherence, and readability of the generated section.
  • Nugget Coverage: Measures the presence of essential, atomic information nuggets, which are automatically extracted from human-written exemplars using LLM-based nuggetization procedures.

b. Retrieval Quality

  • Relevance Rate (RR): Average graded relevance of retrieved references, using the formula:

RR(S)=12SsRel(s)\mathrm{RR}(S) = \frac{1}{2|S|} \sum_{s} \mathrm{Rel}(s)

where Rel(s)\mathrm{Rel}(s) is a 0–2 relevance grading.

  • Reference Coverage (RC): Proportion of key references present in system output, relative to those marked as vital in human exemplars:

RC(S,E)=1EsI[sE]\mathrm{RC}(S, E) = \frac{1}{|E|} \sum_{s} \mathbb{I}[s \in E]

  • Document Importance (DI): Compares the median citation count of retrieved sources to that of the human reference set:

DI(S,S)=min{median({num_cites(x):xS})median({num_cites(x):xS}),1}\mathrm{DI}(S, S^*) = \min\left\{\frac{\mathrm{median}(\{\mathrm{num\_cites}(x)\,:\, x \in S\})}{\mathrm{median}(\{\mathrm{num\_cites}(x^*)\,:\, x^* \in S^*\})}, 1 \right\}

c. Verifiability

  • Citation Precision: Fraction of citations that support their attached claims.
  • Claim Coverage: Fraction of claims in the synthesized text that are verifiably supported by cited sources, computed with a sliding window (e.g., w=1w=1 allows support from a citation in a neighboring sentence).

This multidimensional evaluation contrasts with prior narrow benchmarks and emphasizes both the breadth (retrieval, diversity) and depth (fact verification, logical integration) of research synthesis.

3. Performance and Benchmarking Results

In systematic evaluations against open-source systems (DeepResearcher, STORM, OpenScholar), proprietary LLM-powered search AIs, and commercial systems such as OpenAI DeepResearch, DeepScholar-base establishes a competitive or superior baseline:

  • DeepScholar-base, coupled with models such as GPT-4.1 or Claude-opus-4, yields comparable or higher scores in both organization and nugget coverage (knowledge synthesis), and consistently demonstrates stronger performance in verifiability metrics—up to 6.3× higher than commercial baselines in claim support.
  • Retrieval quality (relevance rate, reference coverage, document importance) is at least on par with, and sometimes exceeds, that of alternative systems.
  • All systems, including DeepScholar-base, currently achieve less than 19%19\% aggregate score across all seven core metrics, reflecting the unresolved complexity of long-form, multi-source synthesis.

The incorporation of LOTUS semantic operators for filtering and aggregation is credited with boosting verifiability and overall robustness relative to architectures that rely solely on LLM generation and heuristic citation.

4. Technical Specifics and Implementation

The system leverages the LOTUS API suite, which includes:

  • Sem-Filter, Sem-TopK, and Sem-Agg operators for semantic filtering, ranking, and synthesis.
  • Iterative, prompt-driven query generation (examples provided in evaluation supplement) that constrains the queries to simple, focused forms for coverage and precision.
  • Control over retrieval source and reproducibility: the ArXiv API is used to ensure consistency across evaluation runs.

The benchmarking framework combines LLM-as-a-judge methodologies for organization and nugget coverage, together with automated entailment- and citation-extraction tools for claim support assessment. Key steps—such as nuggetization, claim-citation linkage, and reference grading—operate at scale and with minimal manual intervention, yielding a live, non-static evaluation infrastructure.

5. Challenges, Limitations, and Open Problems

Despite its robust design and competitive results, DeepScholar-base and the evaluated state of the art face several unresolved challenges:

  • Low Reference and Nugget Coverage: No system consistently retrieves all critical references or captures all core factual nuggets included by human experts (frequently <0.40–0.45 for both metrics).
  • Document Importance Deficit: Median citation counts of retrieved references are often lower than those in human-curated related work sections, indicating under-selection of seminal papers.
  • Verifiability Gaps: Automated evaluation of claim support—while rigorous—may underestimate support compared to expert judgment, particularly for complex or synthesized claims spanning multiple sources.
  • Overall Performance Ceiling: The 19% upper bound across metrics underscores persistent challenges in integrating accurate, diverse retrieval with coherent, fully supported synthesis.

A plausible implication is that improvements in retrieval algorithms (including better query decomposition, semantic expansion, and domain-specific ranking), as well as more refined LLM prompting or hybrid integration, are needed to narrow the performance gap relative to human research synthesis.

6. Future Directions

The DeepScholar-base framework points to several avenues for progress:

  • Enhanced Retrieval: Ablation studies suggest that integrating more comprehensive source coverage and advanced ranking strategies directly impacts downstream synthesis and coverage.
  • Reinforced Verifiability: Finer granularity in claim-citation linkage and improved entailment extraction may strengthen the reliability of citation support metrics.
  • Improved Semantic Operators: Iterative advances in semantic filtering, document aggregation, and query generation are expected to drive gains in both knowledge integration and citation quality.
  • Benchmark Extensions: The pipeline's modular nature allows for extension to alternative corpora, embedding domain-aware heuristics, or coupling with additional external knowledge graphs or APIs.

The evaluation findings position holistic, automated benchmarks like DeepScholar-bench—and reference pipelines like DeepScholar-base—as essential scaffolds for advancing generative research synthesis systems in academic contexts.

7. Summary Table: System Components and Purposes

Pipeline Stage Primary Operation LOTUS Operator / Module
Query Generation LLM-driven query creation N/A
Retrieval Scholarly search API calls N/A
Filtering Semantic relevance assessment Sem-Filter, Sem-TopK
Aggregation Document synthesis, citation Sem-Agg

These components, integrated in DeepScholar-base, collectively automate the process of generating verifiable, citation-rich related work summaries—a foundational operation for AI-driven research assistants evaluated on live, non-static scholarly corpora.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)