Literature-based Question Answering

Updated 24 June 2026

Literature-based Question Answering is a framework that uses automated systems to extract answers from scientific and technical documents by integrating retrieval, extraction, and reasoning techniques.
LBQA systems leverage classical information retrieval, transformer-based extractive models, and multimodal analysis to handle complex document layouts including tables and figures.
Advanced architectures combine dense retrieval, neural reranking, and tool integration to enhance accuracy and enable interdisciplinary applications across biomedical, scholarly, and narrative domains.

Literature-based Question Answering (LBQA) denotes automated systems and workflows designed to answer natural-language questions using scientific, technical, or narrative documents as their primary source of evidence. Unlike general-domain QA—which often exploits shallow web text or encyclopedic corpora—LBQA must robustly handle heterogeneous document structures, domain-specific notations, data tables, figures, and complex discourse. The field encompasses both end-user QA over scholarly or clinical literature and Q&A pair generation for research comprehension or systematic review facilitation. The following sections synthesize state-of-the-art LBQA methodologies, their technical challenges, evaluation strategies, empirical results, and key research frontiers.

1. Core Methodologies and Workflows

LBQA systems span multiple evidence assembly and reasoning paradigms, unified by their aim to extract or generate answers directly grounded in scientific or literary source material rather than pre-annotated closed-domain datasets.

1.1 Classical IR and Extractive QA Pipelines

The foundational architecture consists of document retrieval via TF–IDF/BM25, passage ranking, and answer extraction modules. In biomedical QA, weighted neural retrieval using word embeddings with in-domain idf or question-adjusted idf improves over uniform or document-weighted schemes by emphasizing domain-relevant terms—e.g., in BioASQ, “CD_q” yields MAP=0.377 and F₁@10=0.434, outperforming deep interaction models such as DRMM and Match Pyramid (Galkó et al., 2018).

1.2 Supervised Extractive QA Over Scholarly Documents

End-to-end pipelines for clinical or materials science LBQA are built atop pretrained transformer models (BERT, SciBERT, MatBERT, MatSciBERT) fine-tuned for extractive QA tasks. For property value extraction from perovskite materials literature, MatSciBERT achieves F1=63.9, precision=65.4, recall=64.4, surpassing both rule-based CDE2 and baseline LLMs (Sipilä et al., 2024). CLINIQA introduces UMLS-driven semantic indexing, SVM-based question/document/focus classification, and cosine similarity for phrase/semantic tag vectors, with post-ranking improving mean reciprocal rank to 0.7 at top-1 (Zahid et al., 2018).

1.3 Table-Structured and Visual Document QA

Significant information in the sciences is embedded in tables and not amenable to plain-text extraction. Recent work evaluates three approaches: (1) OCR pipelines (Tesseract, Pix2tex) with text/LaTeX structure fed to LLM QA modules, (2) vision-LLMs (Donut, Pix2Struct, multimodal GPT-4), and (3) explicit table structure recognition using transformer-based object detection and cell-level OCR, followed by structured QA over serialized tables. In RF-EMF extractive QA, multimodal GPT-4 achieves EM=84%, F1=88%; table-structure–aware pipelines reach F1=80%, IoU=0.65 on table bounding boxes, outperforming baseline OCR-only methods (Kim et al., 26 Aug 2025).

1.4 Hybrid, Retrieval-Augmented, and Tool-Integrated QA

Emergent LBQA architectures combine dense retrieval, external tool calls, and staged reasoning over literature and structured knowledge. BioHarness employs dual-view dense retrieval (embedding-based “as-is” plus hypothetical-abstract rewrite), neural reranking, substrate-aware cascade control (returning the answer on high-confidence/grounded evidence, otherwise escalating to REPL-style tool integration—e.g., gene resolver, atlas measurement), raising pooled QA benchmark scores from 65.9 to 71.0 (Xiao et al., 17 Jun 2026).

1.5 Question Answering via Knowledge Graphs and QG Pipelines

LBQA may generate Q&A pairs to explicate scientific contributions through two main strategies: (a) salient paragraph selection with LLM-based question generation (e.g., GPT-3.5), followed by answer extraction or (b) fine-tuned entity-relation extraction (REBEL) for building knowledge graphs and ranking triplets based on TF-IDF centrality, PageRank, and semantic relevance. KG-based QA generation is favored for capturing novelty and crucial relationships not evident from single-document local evidence (Azarbonyad et al., 18 Jul 2025).

1.6 Long-Context and Literary QA

For narrative or literary texts, systems such as BookQA (context retrieval plus multi-hop key-value memory reasoning) and dedicated long-context LLM benchmarks (e.g., LiteraryQA, LittiChoQA) push LBQA into domains requiring sustained discourse, paraphrase, and multi-paragraph reasoning. LiteraryQA benchmarks 7 LLMs with context windows up to 1M tokens; best open models approach ROUGE-L≈0.415, EM≈0.201, while LLM-as-a-Judge evaluations show higher correlation with human judgment than n-gram metrics (Bonomo et al., 15 Oct 2025). LittiChoQA provides over 270k non-factoid QA in 17 Indic languages, evaluating multilingual LLMs and context-shortening strategies (Khandelwal et al., 6 Jan 2026).

2. Evaluation Metrics and Benchmarking

LBQA evaluation employs both extractive and semantic measures, tailored to the answer format and domain:

Span-based metrics: Exact Match (EM), token-level F1: $\mathrm{F1} = \frac{2 \mathrm{Prec} \cdot \mathrm{Rec}}{\mathrm{Prec} + \mathrm{Rec}}$ (as in SQuAD, BioASQ, CLINIQA, etc.) (Galkó et al., 2018, Zahid et al., 2018, Sipilä et al., 2024).
Table QA metrics: Table bounding box IoU and mean average precision (for structure detection) (Kim et al., 26 Aug 2025).
Ranking metrics: Mean Reciprocal Rank (MRR), Precision@k, Hit@5 (in multi-entity complex QA such as LiCQA) (Saha et al., 25 Feb 2026).
Semantic similarity: STS MuTe, METEOR, ROUGE-n/L, BERTScore on narrative QA (Bonomo et al., 15 Oct 2025, Khandelwal et al., 6 Jan 2026).
Human/LLM-as-Judge Rubrics: Annotator Likert scoring, “Excellent/Average/Poor” (by domain expert or API model) (Bonomo et al., 15 Oct 2025, Dayarathne et al., 5 Nov 2025).
Specialized metrics: User effort (words read), answer completeness/relevance/fluency (domain SME scoring in KG-based pipeline) (Azarbonyad et al., 18 Jul 2025).

LLM-as-a-Judge has emerged as the most reliable metric for semantic adequacy and informativeness in long-context/narrative LBQA, with Kendall’s τ up to 0.69 when summaries are provided as auxiliary references (Bonomo et al., 15 Oct 2025). N-gram overlap metrics (EM, F1, ROUGE-L) often exhibit poor to moderate correlation (τ ≈ 0.03–0.44) with human judgment in narrative settings.

3. Practical Implementations and Case Studies

Biomedical and Clinical LBQA

CLINIQA integrates UMLS semantic parsing, SVM/KNN-based evidence detection/classification, and heuristic line/sentence scoring for PubMed abstract retrieval; final answer ranking uses focus-concept detection. Achieved recall ≈ 0.85 and MRR = 0.7 for pancreatic cancer queries (Zahid et al., 2018). BioHarness, through staged cascade reasoning, addresses substrate mismatches (e.g., gene alias ambiguity, technical factoid normalization) by integrating PubMed retrieval, knowledge bases, and biological atlas APIs, yielding +5.1 pooled F1 absolute gain over strong retrieval-only baselines in large multi-format biomedical benchmarks (Xiao et al., 17 Jun 2026).

Scholarly Metadata and Linked Data QA

A hybrid SPARQL + LLM QA system for QALD-2024 queries routes entity- and relation-centric questions to SPARQL query templates and complex/personal queries to BERT extractive QA, achieving EM=0.33, F=0.38 on DBLP, SemOpenAlex, and Wikipedia-derived scholarly QA (Fondi et al., 2024).

Structured and Visual Scientific Document QA

QA over scientific tables leverages transformer-based detection, structure recognition, and OCR fusion. For extractive outcomes on table-rich RF-EMF documents, preserving cell and row/column integrity substantially improves EM and recall. Failure modes include OCR symbol misrecognition, poor segmentation of non-trivial layouts, and multimodal model sensitivity to image quality (Kim et al., 26 Aug 2025).

Retrieval-Augmented Generation (RAG) for Scholarly QA

RAG architectures, as in contemporary computer science literature QA, use SPECTER (SciBERT-derived) embeddings, FAISS vector retrieval, and LLMs (GPT-3.5, Mistral-7B, Falcon-7B, etc.) to answer both binary and open questions. With properly engineered prompts and domain-specific retrieval, GPT-3.5+RAG achieves accuracy=0.90, with Mistral-7B+RAG as the top open-source model. Latency and inference cost differ sharply by deployment (API vs. local), with open-source quantized models lagging in throughput (Dayarathne et al., 5 Nov 2025).

4. Complex, Multi-Document, and Long-Context QA

Answer synthesis for questions requiring aggregation across documents or long evidence chains necessitates advanced retrieval, candidate scoring, and reasoning:

Complex QA:

LiCQA retrieves top-k documents, extracts candidate entities by NER (type-matched), computes their semantic similarity (InferSent/Sentence-BERT), aggregates scores, and combines with document frequency. Its unsupervised, hyperparameter-light architecture achieves MRR=0.432, P@1=0.293 on complex Wikipedia/GoogleTrends benchmarks, >8x faster than baseline graph-based and neural QA (Saha et al., 25 Feb 2026).

Long-context QA in Literature:

BookQA deploys BERT-based retrieval (fine-tuned for binary passage relevance) and multi-hop key-value memory reasoning for character-centric “Who” questions. Despite advances (P@1≈18.7%, MRR≈0.376), strong dependence on context retrieval and the need for improved commonsense and multi-hop narrative inference persist (Angelidis et al., 2019). LiteraryQA and LittiChoQA further extend these evaluation setups to full novels and low-resource languages with tens/hundreds of thousands of QA pairs (Bonomo et al., 15 Oct 2025, Khandelwal et al., 6 Jan 2026).

5. Interactive and Generative QA Paradigms

Interactive LBQA includes pipelines that support iterative user-driven clarification, follow-up questions, and conversational context maintenance (Biancofiore et al., 2022). Techniques span from stateless disambiguation and follow-up suggestion to stateful dialog with explicit or implicit coreference/state tracking via BERT/transformer backbone models. End-to-end systems such as AnswerQuest integrate extractive QA with neural question generation for multi-paragraph documents, exploiting shared normalization and answer verification filters without joint optimization. Metrics include EM, F1, and QA-verified fluency/answerability (Roemmele et al., 2021).

QA pair generation for scientific comprehension can exploit both CCQG (salient paragraph+LLM QG) and KG-based pipelines (REBEL ER extraction, triplet saliency measures, and SME/LLM rubric evaluation), delivering high relevance, specificity, and factuality in both computer science and life science research (Azarbonyad et al., 18 Jul 2025).

6. Open Challenges and Prospective Directions

Challenges in LBQA are persistent across domains:

Table structure and notation: Accurate handling of merged/nested cells, mathematical symbols, and dense formatting necessitates explicit structure detection and domain-adaptive OCR (future work: transformer models with joint layout and decoding, uncertainty calibration) (Kim et al., 26 Aug 2025).
Entity normalization and substrate integration: Resolving gene/chemical aliases, knowledge ID mapping, and accessing structured atlas data demands explicit tools and controller logic, as in BioHarness (Xiao et al., 17 Jun 2026).
Evaluation alignment: Traditional n-gram metrics dissociate from human semantic judgments on long-context or abstractive QA; LLM-as-a-Judge and semantic STS measures are now preferred, though expensive and less interpretable (Bonomo et al., 15 Oct 2025, Khandelwal et al., 6 Jan 2026).
Cross-document synthesis and multi-hop inference: Solutions require unsupervised or minimally-supervised coordination of retrieval, candidate extraction, and reasoning (e.g., LiCQA, REPL-style reasoning in BioHarness), with continuing bottlenecks in recall and robust entity linking (Saha et al., 25 Feb 2026, Xiao et al., 17 Jun 2026).
Scalability, throughput, and resource adaptation: Efficient context window management, quantized/accelerated inference, and continual model/version refresh are central to deploying LBQA at scale (Dayarathne et al., 5 Nov 2025, Khandelwal et al., 6 Jan 2026).
Hallucination and verification: Hallucination in generative LLMs, especially in zero/few-shot and non-English settings, remains a central concern, with post-hoc NLI-based detectors only partially mitigating the issue (Azarbonyad et al., 18 Jul 2025, Sipilä et al., 2024).

Future research is focused on end-to-end differentiable control, joint retrieval-generation, task-specific pretraining (especially for scientific narrative coherence and table understanding), fine-grained uncertainty estimation, and extension of QA datasets to low-resource languages and underrepresented scientific domains (Khandelwal et al., 6 Jan 2026, Bonomo et al., 15 Oct 2025, Kim et al., 26 Aug 2025).

In summary, literature-based question answering has evolved from index-driven passage extraction to multi-stage, neural, and tool-integrated architectures capable of structured, abstractive, and context-aware reasoning. Effective LBQA depends critically on evidence assembly strategies, domain-specific adaptation, explicit structure preservation, and calibrated evaluation. The field remains highly dynamic, with advances in retrieval, pretrained LLMs, visual understanding, KG construction, and evaluation driving improvements in both extractive and generative LBQA scenarios.