ScholarQA: Scholarly Question Answering

Updated 28 February 2026

ScholarQA is a scholarly question answering system that integrates structured bibliographic knowledge graphs with unstructured full-text research papers for precise multi-hop reasoning.
It employs hybrid retrieval-augmented generation pipelines combining sparse and dense retrieval, cross-encoder re-ranking, and LLM-based synthesis with strict attribution.
Benchmark datasets like Hybrid-SQuAD and QALD-2024 validate its performance using metrics such as EM, F1 score, and nDCG, highlighting improvements in factual accuracy and retrieval robustness.

ScholarQA (SQA) addresses question answering in the scholarly domain, encompassing techniques, benchmarks, and systems designed to reason over heterogeneous data sources such as bibliographic knowledge graphs (KGs), full-text corpora, and metadata. SQA encompasses both open-domain scientific QA over literature and highly structured queries over scholarly KGs, increasingly utilizing hybrid retrieval, generation, and attribution mechanisms with LLMs and retrieval-augmented generation (RAG) pipelines.

1. Problem Scope and Challenges

Scholarly QA diverges from generic open-domain QA due to the fragmentation of relevant information across structured KGs (examples: DBLP, SemOpenAlex) and unstructured texts (such as millions of full-text research papers and Wikipedia articles). Traditional QA systems targeting only KG triples cannot extract biographical or discursive information, while purely text-based models fail to capture precise bibliometric facts or structured relationships. Realistic scholarly queries (“main research focus of author X,” “citation metrics of institution Y”) frequently necessitate cross-source traversal, entity disambiguation, and multi-hop reasoning, requiring QA models to integrate KG precision and broad text coverage (Taffa et al., 2024, Fondi et al., 2024).

2. Benchmark Datasets and Task Formulations

The evaluation of SQA systems builds upon benchmarks that enforce reasoning across both KGs and unstructured text:

Hybrid-SQuAD (“Hybrid Scholarly Question Answering Dataset”) contains 10,581 question-answer pairs, generated by LLMs with structured sources from DBLP, SemOpenAlex, and Wikipedia. Its design ensures that answers often require bridging multiple data types (e.g., KG→Text, KG→KG→Text). The dataset captures a balanced distribution of bibliometric, biographical, organizational, and research-focused questions, with a mean question length of ~14 tokens and short span answers. LLM-driven generation was used for scalability, with human spot checks for quality. Traversal statistics highlight the dominance of KG→Text paths (55.6% of test cases), underscoring the necessity of integrating heterogeneous data (Taffa et al., 2024).
QALD-2024 Scholarly Hybrid QA (as in (Fondi et al., 2024)) and other scenario-based datasets define rigorous test regimes with precise entity linking, structured intent labeling, and explicit evaluation of multi-type questions (counts, affiliations, biographies).
Scenario-Based Multiple-Choice SQA (JEEVES architecture) expands on SQA for multi-choice scenarios, combining scenario contextualization with evidence retrieval from large external corpora, modeling enriched option structures ([S; Q; O_i]) (Huang et al., 2021).

3. System Architectures

3.1 Hybrid RAG Pipelines

Modern SQA systems employ retrieval-augmented generation (RAG) pipelining, jointly leveraging dense and sparse retrieval, KG querying, cross-encoder and LLM-based re-ranking, and structured answer generation:

Ai2 Scholar QA: Integrates over 100M abstracts (Semantic Scholar) and 11.7M full-text papers (Vespa cluster, 285.6M passages, max 480 tokens). The retrieval stack combines sparse (BM25) and dense (embedding) indexing, with ensemble ranking $Score(p) = \alpha \cdot cosine(e_{query}, e_p) + (1–\alpha) \cdot BM25\_{norm}(p)$ ( $\alpha=0.6$ ). Up to 276 top passages are selected, re-ranked with mxbai-rerank-large-v1, and funneled into a multi-step generative LLM workflow. The answer module executes quote extraction, thematic outlining, section-wise synthesis, and tabularization, inserting fine-grained inline citations ([PaperID, ¶n]) for every factual claim. Attribution is systematically enforced and validated with an entailment model (GPT-4o judge), with citation precision/recall audited using protocols such as ALCE (Singh et al., 15 Apr 2025).
Hybrid-SQuAD Baseline: Applies a three-stage pipeline: (a) sub-question extraction and SPARQL linking; (b) hybrid KG + dense text retrieval (FAISS, top-k cosine similarity); (c) answer synthesis that concatenates KG triples (as natural language spans) and text passages in a prompt for LLM generation (Taffa et al., 2024).
SPARQL–LLM Hybrid Systems: Use divide-and-conquer routines to classify entity/intent, route queries to the appropriate KG (DBLP, SemOpenAlex) or text endpoint, execute parameterized SPARQL, and, for context-heavy cases, apply extractive QA (e.g., BERT-base-cased-SQuAD2 with top-N textual and KG-derived context). Aggregation and answer selection follow, using simple confidence ensembles between structured and unstructured sources (Fondi et al., 2024).

3.2 Scenario-Based Models

JEEVES: Joint retriever-reader architecture for scenario-based QA. The retriever is trained with implicit supervision on QA labels via a novel word-weighting network; retrieved paragraphs are scored with both sparse BM25 and word salience; reader fuses k top paragraphs via intra/inter-paragraph dual attention and self-attention mechanisms. Both retriever and reader participate in a joint end-to-end loss function, enhancing retrieval relevance and answer grounding (Huang et al., 2021).

4. Attribution, Validation, and Evaluation

4.1 Attribution Strategies

Attribution is central to SQA, particularly in generative models:

LLMs are instructed to append inline citations “[PaperID, ¶n]” to every factual claim, and to cite “LLM Memory” if introducing unsupported information (Singh et al., 15 Apr 2025).
Post-generation validation uses entailment models to classify each (claim, citation) pair as “Attributable” or not; citation precision and recall are calculated as per the ALCE protocol (Singh et al., 15 Apr 2025).

4.2 Evaluation Metrics

SQA systems are evaluated across retrieval, reranking, and generation:

Metric	Definition/Scope
Exact Match (EM)	Fraction of answers exactly matching the gold span
F1 Score	Span-level token overlap
nDCG@10	$\sum_{i=1}^{10} \frac{2^{rel_i}–1}{\log_2(i+1)}$ ; rel_i: graded relevance
mRR	Mean Reciprocal Rank, averaged over queries
RubricsScore, TotalScore	Human-composed rubrics (60% question-specific, 40% global), plus bonuses for answer length, expertise, citations (Singh et al., 15 Apr 2025)
Citation Precision/Recall	As defined above, using entailment-based audit

Hybrid-SQuAD Baseline: RAG+ChatGPT-3.5 achieves EM=69.65%, F1=74.91%; best KG-only and text-only systems are 15–37 points lower (Taffa et al., 2024).
Ai2 Scholar QA: Outperforms all baselines on ScholarQA-CS (Rubrics: 58.0, Total: 61.9, Citation P/R: 48.1), with cross-encoder reranking nDCG@10=0.927, mRR=0.975 (Singh et al., 15 Apr 2025).
Divide-and-Conquer Hybrid: Combined SPARQL–LLM pipeline surpasses both SPARQL-only and LLM-only on QALD-2024 (EM: 61.8%, F1: 72.4%) (Fondi et al., 2024).

5. Implementation, Accessibility, and Reproducibility

Open-Source Frameworks: ai2-scholar-qa publicly releases all pipeline modules (retrieval, reranking, generation), evaluation scripts, and index-building utilities. Web applications (Typescript + React) provide interactive QA, with full documentation on data ingestion, API usage, and reproduction of benchmark results (Singh et al., 15 Apr 2025).
Dataset Creation: Hybrid-SQuAD generation employs automated LLM prompting from LLM-parsed KGs and Wikipedia, with strict post-processing and attribution tracing for QA pairs (Taffa et al., 2024).

6. Limitations, Design Insights, and Future Directions

SQA systems confront multiple technical and computational barriers:

Retrieval-Bottleneck: Real-time SPARQL performance and index scalability remain limiting for multi-source retrieval (Fondi et al., 2024).
Entity Linking and Multi-Hop Reasoning: Current approaches are sensitive to ambiguous author names and multi-hop KG→KG→Text traversals; these remain dominant error sources (Taffa et al., 2024).
Attribution Robustness: Quote extraction, enforced attribution, and section-wise synthesis measurably reduce hallucinations but do not eliminate them entirely; abstention (“LLM Memory”) is used to signal unverifiable claims (Singh et al., 15 Apr 2025).
Generalizability: Most SQA systems are concentrated on English, with limited evaluation for multilingual, non-traditional, or student-centered tutoring queries (Taffa et al., 2024).

Key design insights include: hybrid sparse+dense retrieval with tuned weighting optimizes recall/precision tradeoff (α=0.6); cross-encoder reranking outperforms bi-encoder baselines; multi-step prompting with forced thematic sections enhances answer structure and user experience (Singh et al., 15 Apr 2025). Potential advances involve end-to-end fine-tuning, dynamic KG indexing, robust entity linking via neural methods, and extension to non-English corpora (Taffa et al., 2024, Fondi et al., 2024).

7. Representative Systems and Comparative Results

System	Architecture	Main Data Sources	Key Metrics
Ai2 Scholar QA	Full RAG, hybrid retrieval, LLM gen., explicit attribution	S2ORC full-text, Semantic Scholar	Rubrics 58.0, nDCG@10 0.927
Hybrid-SQuAD Baseline	Hybrid RAG (SPARQL + dense text)	DBLP, SemOpenAlex, Wikipedia	EM 69.65%, F1 74.91%
SPARQL–LLM Hybrid	Divide-and-conquer, extractive QA	DBLP, SemOpenAlex, Wikipedia	EM 61.8%, F1 72.4% (QALD-2024)
JEEVES	Joint retriever-reader	Scenario, large corpus	2–6% absolute accuracy gain

These systems collectively demonstrate the superiority of hybrid, multi-source RAG models over strict KG- or text-only approaches, and illustrate the necessity of structured attribution and fine-grained evaluation for advancing SQA.

References: (Singh et al., 15 Apr 2025, Taffa et al., 2024, Fondi et al., 2024, Huang et al., 2021)