- The paper proposes the SerenQA framework and RNS metric to quantitatively measure serendipity in KGQA for drug repurposing.
- It integrates LLM ensembles, expert crowdsourcing, and RNS-guided optimization to partition explicit and serendipitous answer sets.
- Experimental results reveal a trade-off between factual accuracy and the generation of novel, surprising drug repurposing insights.
Assessing LLMs for Serendipitous Discovery in Knowledge Graphs: The SerenQA Framework for Drug Repurposing
Introduction
This paper investigates the capacity of LLMs to uncover serendipitous connections within scientific Knowledge Graph Question Answering (KGQA), proposing the SerenQA framework as an evaluative methodology specifically contextualized on the biomedical task of drug repurposing. The premise is that while LLMs have achieved robust performance in knowledge graph-augmented question answering, their answers tend to be highly relevant yet predictable—a limitation when scientific innovation often depends on the capacity to surface results that are not just correct but genuinely surprising, insightful, and inspiring for experts.
Serendipity in Knowledge Graph QA
The authors formally introduce serendipity-awareness in KGQA as the identification of answer candidates that are not merely accurate or familiar (existing knowledge), but that satisfy a joint criterion of relevance, novelty, and surprise—properties considered central to serendipitous discovery in scientific workflows. They operationalize this through the concept of an answer partition (Ae​,As​), where Ae​ denotes explicit, KG-supported answers and As​ contains answers that are contextually relevant but extend beyond known facts, potentially revealing non-obvious research opportunities.
A key contribution is the RNS (Relevance, Novelty, Surpriseness) metric, which provides a quantitative, information-theoretic score for serendipity in KGQA. RNS is defined as a weighted sum of:
- Relevance (R): Contextual proximity between As​ and Ae​ as measured by GCN embedding distances, ensuring serendipitous answers remain scientifically meaningful.
- Novelty (N): Computed as (1 – Mutual Information) between the two sets; high novelty implies As​ encodes knowledge not redundant with Ae​.
- Surprise (S): Formalized as the Jensen–Shannon divergence between inferred probability distributions over As​ and Ae​, capturing distributional shift and unpredictability.
Each component is rigorously axiomized: RNS demonstrates scale-invariance, dependence only on local subgraphs, and non-monotonicity with respect to answer set size. Underlying probability distributions are computed via multi-hop (three-hop) transition matrices over the KG, approximated efficiently via matrix multiplication and power/PageRank-style iterations, making the metric scalable to large graphs.
Benchmark Construction: Serendipity-aware Drug Repurposing Dataset
Because benchmarking serendipity requires high-quality reference partitions, the authors construct a dataset on top of the Clinical Knowledge Graph (CKG), densely annotated for drug-disease relationships. Questions are paired with Cypher queries, structurally decomposed to match realistic, complex multi-hop patterns common in biomedical informatics. For each query, the full candidate answer set is partitioned into Ae​ (existing) and As​ (serendipity) using three strategies:
- LLM Ensemble: Multiple LLMs score candidate answers for serendipity; top answers form As​.
- Expert Crowdsourcing: Biomedical experts refine and validate serendipity labels, ensuring clinical plausibility.
- RNS-Guided Greedy Optimization: The RNS criterion is maximized directly using a swap-based search in the candidate space, with calibration to expert-validated results.
This segmentation simulates an open-world setup, emulating how real-world discoveries often originate outside the observed domain KG.
Evaluation Pipeline
SerenQA implements a three-stage pipeline to systematically profile LLM performance:
- Knowledge Retrieval: LLMs are tasked with translating NL queries into KG queries and retrieving Ae​. Metrics include hit rate, F1, and executability across varying query complexities (1-hop, 2-hop, 3+ hops, intersections).
- Subgraph Reasoning: LLMs generate natural language summaries of retrieved subgraphs, tested for factual faithfulness, comprehensiveness, and explicit coverage of serendipitous paths.
- Serendipity Exploration: Starting from Ae​, LLMs perform guided beam search (with both chain-of-thought and standard prompts) to discover As​. Evaluations rely on relevance to ground-truth, type-matching, and exact serendipity hits.
Experimental Results
Knowledge Retrieval: SOTA LLMs (e.g., DeepSeek, GPT-4) achieve high F1 (∼78%) on simple queries, but their accuracy degrades rapidly as query complexity increases, particularly for multihop and intersection types (<10% F1 for >2-hop).
Subgraph Reasoning: There is a fundamental trade-off: models with higher factual accuracy cover serendipity less extensively, while models with broader coverage (e.g., Mixtral-8×7B) may hallucinate more frequently, lowering faithfulness scores.
Serendipity Exploration: All models exhibit low serendipity recall (SerenHit <0.15), with larger models only modestly outperforming smaller ones. Notably, removal of intermediate summaries sometimes increases performance, suggesting summarization may induce additional hallucinations in exploration. No model dominates all metrics, underlining the challenge of reliably surfacing non-obvious, yet meaningful, connections.
Partition Robustness: Cross-partition evaluation (ensemble, expert, RNS-guided) yields high correlation, validating the benchmark construction and the RNS metric. The testbed is robust to labeling and methodological choices.
Implications and Future Directions
The findings indicate that, while LLMs have strong retrieval and reasoning capabilities within the explicit knowledge embodied by KGs, their ability for serendipity-driven discovery is severely limited under current architectures and training regimes. This underscores the distinction between extrapolative (retrieval, reasoning) and truly generative (hypothesis-forming, non-trivial link finding) modes. The modular, open-source SerenQA toolkit provides a rigorous foundation for future work in AI-accelerated scientific discovery, suggesting that further progress will require:
- Explicit modeling of out-of-graph priors, possibly via integration of multiple KGs or external (literature-derived) evidence bases.
- Joint optimization of factuality and serendipity via multi-agent or Mixture-of-Experts (MoE) architectures.
- Improved prompt engineering and systematic chain-of-thought methods for abstraction and analogy.
- Scaling human-in-the-loop feedback to refine RNS weighting and supervision.
Conclusion
SerenQA codifies the evaluation of serendipitous knowledge discovery in KGQA and applies it to drug repurposing, combining theoretical rigor in metric design, robust benchmark annotation, and empirical LLM evaluation. The results expose substantial gaps in current LLM methodologies for scientific serendipity, motivating future research at the intersection of automated knowledge mining, generative reasoning, and interdisciplinary expert curation. The tools and resources made publicly available by this work are positioned to catalyze systemic improvements in the development of genuinely discovery-oriented language agents.