Assessing LLMs for Serendipity Discovery in Knowledge Graphs: A Case for Drug Repurposing (2511.12472v1)

Published 16 Nov 2025 in cs.CL and cs.AI

Abstract: LLMs have greatly advanced knowledge graph question answering (KGQA), yet existing systems are typically optimized for returning highly relevant but predictable answers. A missing yet desired capacity is to exploit LLMs to suggest surprise and novel ("serendipitious") answers. In this paper, we formally define the serendipity-aware KGQA task and propose the SerenQA framework to evaluate LLMs' ability to uncover unexpected insights in scientific KGQA tasks. SerenQA includes a rigorous serendipity metric based on relevance, novelty, and surprise, along with an expert-annotated benchmark derived from the Clinical Knowledge Graph, focused on drug repurposing. Additionally, it features a structured evaluation pipeline encompassing three subtasks: knowledge retrieval, subgraph reasoning, and serendipity exploration. Our experiments reveal that while state-of-the-art LLMs perform well on retrieval, they still struggle to identify genuinely surprising and valuable discoveries, underscoring a significant room for future improvements. Our curated resources and extended version are released at: https://cwru-db-group.github.io/serenQA.

Summary

The paper proposes the SerenQA framework and RNS metric to quantitatively measure serendipity in KGQA for drug repurposing.
It integrates LLM ensembles, expert crowdsourcing, and RNS-guided optimization to partition explicit and serendipitous answer sets.
Experimental results reveal a trade-off between factual accuracy and the generation of novel, surprising drug repurposing insights.

Assessing LLMs for Serendipitous Discovery in Knowledge Graphs: The SerenQA Framework for Drug Repurposing

Introduction

This paper investigates the capacity of LLMs to uncover serendipitous connections within scientific Knowledge Graph Question Answering (KGQA), proposing the SerenQA framework as an evaluative methodology specifically contextualized on the biomedical task of drug repurposing. The premise is that while LLMs have achieved robust performance in knowledge graph-augmented question answering, their answers tend to be highly relevant yet predictable—a limitation when scientific innovation often depends on the capacity to surface results that are not just correct but genuinely surprising, insightful, and inspiring for experts.

Serendipity in Knowledge Graph QA

The authors formally introduce serendipity-awareness in KGQA as the identification of answer candidates that are not merely accurate or familiar (existing knowledge), but that satisfy a joint criterion of relevance, novelty, and surprise—properties considered central to serendipitous discovery in scientific workflows. They operationalize this through the concept of an answer partition $(A_e, A_s)$ , where $A_e$ denotes explicit, KG-supported answers and $A_s$ contains answers that are contextually relevant but extend beyond known facts, potentially revealing non-obvious research opportunities.

Formalization: The RNS Serendipity Metric

A key contribution is the RNS (Relevance, Novelty, Surpriseness) metric, which provides a quantitative, information-theoretic score for serendipity in KGQA. RNS is defined as a weighted sum of:

Relevance ( $R$ ): Contextual proximity between $A_s$ and $A_e$ as measured by GCN embedding distances, ensuring serendipitous answers remain scientifically meaningful.
Novelty ( $N$ ): Computed as (1 – Mutual Information) between the two sets; high novelty implies $A_s$ encodes knowledge not redundant with $A_e$ .
Surprise ( $S$ ): Formalized as the Jensen–Shannon divergence between inferred probability distributions over $A_s$ and $A_e$ , capturing distributional shift and unpredictability.

Each component is rigorously axiomized: RNS demonstrates scale-invariance, dependence only on local subgraphs, and non-monotonicity with respect to answer set size. Underlying probability distributions are computed via multi-hop (three-hop) transition matrices over the KG, approximated efficiently via matrix multiplication and power/PageRank-style iterations, making the metric scalable to large graphs.

Benchmark Construction: Serendipity-aware Drug Repurposing Dataset

Because benchmarking serendipity requires high-quality reference partitions, the authors construct a dataset on top of the Clinical Knowledge Graph (CKG), densely annotated for drug-disease relationships. Questions are paired with Cypher queries, structurally decomposed to match realistic, complex multi-hop patterns common in biomedical informatics. For each query, the full candidate answer set is partitioned into $A_e$ (existing) and $A_s$ (serendipity) using three strategies:

LLM Ensemble: Multiple LLMs score candidate answers for serendipity; top answers form $A_s$ .
Expert Crowdsourcing: Biomedical experts refine and validate serendipity labels, ensuring clinical plausibility.
RNS-Guided Greedy Optimization: The RNS criterion is maximized directly using a swap-based search in the candidate space, with calibration to expert-validated results.

This segmentation simulates an open-world setup, emulating how real-world discoveries often originate outside the observed domain KG.

Evaluation Pipeline

SerenQA implements a three-stage pipeline to systematically profile LLM performance:

Knowledge Retrieval: LLMs are tasked with translating NL queries into KG queries and retrieving $A_e$ . Metrics include hit rate, F1, and executability across varying query complexities (1-hop, 2-hop, 3+ hops, intersections).
Subgraph Reasoning: LLMs generate natural language summaries of retrieved subgraphs, tested for factual faithfulness, comprehensiveness, and explicit coverage of serendipitous paths.
Serendipity Exploration: Starting from $A_e$ , LLMs perform guided beam search (with both chain-of-thought and standard prompts) to discover $A_s$ . Evaluations rely on relevance to ground-truth, type-matching, and exact serendipity hits.

Experimental Results

Knowledge Retrieval: SOTA LLMs (e.g., DeepSeek, GPT-4) achieve high F1 ( $\sim$ 78%) on simple queries, but their accuracy degrades rapidly as query complexity increases, particularly for multihop and intersection types ( $<$ 10% F1 for $>$ 2-hop).

Subgraph Reasoning: There is a fundamental trade-off: models with higher factual accuracy cover serendipity less extensively, while models with broader coverage (e.g., Mixtral-8×7B) may hallucinate more frequently, lowering faithfulness scores.

Serendipity Exploration: All models exhibit low serendipity recall (SerenHit $<$ 0.15), with larger models only modestly outperforming smaller ones. Notably, removal of intermediate summaries sometimes increases performance, suggesting summarization may induce additional hallucinations in exploration. No model dominates all metrics, underlining the challenge of reliably surfacing non-obvious, yet meaningful, connections.

Partition Robustness: Cross-partition evaluation (ensemble, expert, RNS-guided) yields high correlation, validating the benchmark construction and the RNS metric. The testbed is robust to labeling and methodological choices.

Implications and Future Directions

The findings indicate that, while LLMs have strong retrieval and reasoning capabilities within the explicit knowledge embodied by KGs, their ability for serendipity-driven discovery is severely limited under current architectures and training regimes. This underscores the distinction between extrapolative (retrieval, reasoning) and truly generative (hypothesis-forming, non-trivial link finding) modes. The modular, open-source SerenQA toolkit provides a rigorous foundation for future work in AI-accelerated scientific discovery, suggesting that further progress will require:

Explicit modeling of out-of-graph priors, possibly via integration of multiple KGs or external (literature-derived) evidence bases.
Joint optimization of factuality and serendipity via multi-agent or Mixture-of-Experts (MoE) architectures.
Improved prompt engineering and systematic chain-of-thought methods for abstraction and analogy.
Scaling human-in-the-loop feedback to refine RNS weighting and supervision.

Conclusion

SerenQA codifies the evaluation of serendipitous knowledge discovery in KGQA and applies it to drug repurposing, combining theoretical rigor in metric design, robust benchmark annotation, and empirical LLM evaluation. The results expose substantial gaps in current LLM methodologies for scientific serendipity, motivating future research at the intersection of automated knowledge mining, generative reasoning, and interdisciplinary expert curation. The tools and resources made publicly available by this work are positioned to catalyze systemic improvements in the development of genuinely discovery-oriented language agents.