Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning (2304.02711v2)

Published 5 Apr 2023 in cs.AI and cs.LG

Abstract: Creating knowledge bases and ontologies is a time consuming task that relies on a manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrary complex nested knowledge schemas. Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of LLMs to perform zero-shot learning (ZSL) and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against GPT-3+ to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for all matched elements. We present examples of use of SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease causation graphs. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction (RE) methods, but has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. SPIRES is available as part of the open source OntoGPT package: https://github.com/ monarch-initiative/ontogpt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Denny Vrandečić. Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12 Companion, pages 1063–1064, New York, NY, USA, 2012. ACM. ISBN 9781450312301. doi:10.1145/2187980.2188242.
  2. The Gene Ontology Consortium. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res., 47(D1):D330–D338, January 2019. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gky1055.
  3. The reactome pathway knowledgebase. Nucleic Acids Res., 46(D1):D649–D655, January 2018. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkx1132.
  4. Is ChatGPT a biomedical expert? – exploring the Zero-Shot performance of current GPT models in biomedical tasks. In CLEF 2023: Conference and Labs of the Evaluation Forum, June 2023. doi:10.48550/arXiv.2306.16108.
  5. Survey of hallucination in natural language generation. arXiv, February 2022. doi:10.1145/3571730.
  6. Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Trans. Assoc. Comput. Linguist., 8:34–48, December 2020. ISSN 2307-387X. doi:10.1162/tacl_a_00298.
  7. The 2019 n2c2/OHNLP track on clinical semantic textual similarity: Overview. JMIR Med Inform, 8(11):e23375, November 2020. ISSN 2291-9694. doi:10.2196/23375.
  8. Quantification of BERT diagnosis generalizability across medical specialties using semantic dataset distance. AMIA Jt Summits Transl Sci Proc, 2021:345–354, May 2021. ISSN 2153-4063. doi:10.1371/journal.pone.0112774.
  9. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform., 23(6), November 2022. ISSN 1467-5463, 1477-4054. doi:10.1093/bib/bbac409.
  10. Will generative artificial intelligence deliver on its promise in health care? JAMA, November 2023. ISSN 0098-7484, 1538-3598. doi:10.1001/jama.2023.25054.
  11. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, June 2017. doi:10.48550/arXiv.1706.03762.
  12. Language models are Few-Shot learners. arXiv, May 2020. doi:10.48550/arXiv.2005.14165.
  13. YAML ain’t markup language (YAML™) version 1.2.2. https://yaml.org/spec/1.2.2/, 2021. Accessed: 2023-3-28.
  14. FoodOn: a harmonized food ontology to increase global food traceability, quality control and data integration. NPJ Sci Food, 2:23, December 2018. ISSN 2396-8370. doi:10.1038/s41538-018-0032-6.
  15. Units of measure in clinical information systems. J. Am. Med. Inform. Assoc., 6(2):151–162, 1999. ISSN 1067-5027. doi:10.1136/jamia.1999.0060151.
  16. DBpedia - a crystallization point for the web of data. Journal of Web Semantics, 7(3):154–165, September 2009. ISSN 1570-8268. doi:10.1016/j.websem.2009.07.002.
  17. Adoption of BioPortal’s ontology registry software: The emerging OntoPortal community. In RDA P13 2019 - 13th Research Data Alliance Plenary Meeting, April 2019.
  18. Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. Bioinformatics Advances, 2(1), January 2022. doi:10.1093/bioadv/vbac034.
  19. OGER : hybrid multi-type entity recognition. Journal of Cheminformatics, 11(1), 2019. doi:10.1186/s13321-018-0326-3.
  20. Unifying the identification of biomedical entities with the bioregistry. Sci Data, 9(1):714, November 2022. ISSN 2052-4463. doi:10.1038/s41597-022-01807-3.
  21. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016:baw068, May 2016. ISSN 0162-4105, 1758-0463. doi:10.1093/database/baw068.
  22. C E Lipscomb. Medical subject headings (MeSH). Bull. Med. Libr. Assoc., 88(3):265–266, July 2000. ISSN 0025-7338.
  23. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res., 44(D1):D1214–9, January 2016. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkv1031.
  24. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res., 46(D1):D1074–D1082, January 2018. ISSN 0305-1048. doi:10.1093/nar/gkx1037.
  25. The medical dictionary for regulatory activities (MedDRA). Drug Saf., 20(2):109–117, 1999. ISSN 0114-5916. doi:10.2165/00002018-199920020-00002.
  26. Progress toward a universal biomedical data translator. Clin. Transl. Sci., May 2022. ISSN 1752-8054, 1752-8062. doi:10.1111/cts.13301.
  27. Deepak R Unni, Sierra A T Moxon, Michael Bada, Matthew Brush, Richard Bruskiewich, J Harry Caufield, Paul A Clemons, Vlado Dancik, Michel Dumontier, Karamarie Fecho, Gustavo Glusman, Jennifer J Hadlock, Nomi L Harris, Arpita Joshi, Tim Putman, Guangrong Qin, Stephen A Ramsey, Kent A Shefchek, Harold Solbrig, Karthik Soman, Anne E Thessen, Melissa A Haendel, Chris Bizon, Christopher J Mungall, Liliana Acevedo, Stanley C Ahalt, John Alden, Ahmed Alkanaq, Nada Amin, Ricardo Avila, Jim Balhoff, Sergio E Baranzini, Andrew Baumgartner, William Baumgartner, Basazin Belhu, Mackenzie Brandes, Namdi Brandon, Noel Burtt, William Byrd, Jackson Callaghan, Marco Alvarado Cano, Steven Carrell, Remzi Celebi, James Champion, Zhehuan Chen, Mei-Jan Chen, Lawrence Chung, Kevin Cohen, Tom Conlin, Dan Corkill, Maria Costanzo, Steven Cox, Andrew Crouse, Camerron Crowder, Mary E Crumbley, Cheng Dai, Vlado Dančík, Ricardo De Miranda Azevedo, Eric Deutsch, Jennifer Dougherty, Marc P Duby, Venkata Duvvuri, Stephen Edwards, Vincent Emonet, Nathaniel Fehrmann, Jason Flannick, Aleksandra M Foksinska, Vicki Gardner, Edgar Gatica, Amy Glen, Prateek Goel, Joseph Gormley, Alon Greyber, Perry Haaland, Kristina Hanspers, Kaiwen He, Kaiwen He, Jeff Henrickson, Eugene W Hinderer, Maureen Hoatlin, Andrew Hoffman, Sui Huang, Conrad Huang, Robert Hubal, Kenneth Huellas-Bruskiewicz, Forest B Huls, Lawrence Hunter, Greg Hyde, Tursynay Issabekova, Matthew Jarrell, Lindsay Jenkins, Adam Johs, Jimin Kang, Richa Kanwar, Yaphet Kebede, Keum Joo Kim, Alexandria Kluge, Michael Knowles, Ryan Koesterer, Daniel Korn, David Koslicki, Ashok Krishnamurthy, Lindsey Kvarfordt, Jay Lee, Margaret Leigh, Jason Lin, Zheng Liu, Shaopeng Liu, Chunyu Ma, Andrew Magis, Tarun Mamidi, Meisha Mandal, Michelle Mantilla, Jeffrey Massung, Denise Mauldin, Jason McClelland, Julie McMurry, Philip Mease, Luis Mendoza, Marian Mersmann, Abrar Mesbah, Matthew Might, Kenny Morton, Sandrine Muller, Arun Teja Muluka, John Osborne, Phil Owen, Michael Patton, David B Peden, R Carter Peene, Bria Persaud, Emily Pfaff, Alexander Pico, Elizabeth Pollard, Guthrie Price, Shruti Raj, Jason Reilly, Anders Riutta, Jared Roach, Ryan T Roper, Greg Rosenblatt, Irit Rubin, Sienna Rucka, Nathaniel Rudavsky-Brody, Rayn Sakaguchi, Eugene Santos, Kevin Schaper, Charles P Schmitt, Shepherd Schurman, Erik Scott, Sarah Seitanakis, Priya Sharma, Ilya Shmulevich, Manil Shrestha, Shalki Shrivastava, Meghamala Sinha, Brett Smith, Noel Southall, Nicholas Southern, Lisa Stillwell, Michael “ Michi” Strasser, Andrew I Su, Casey Ta, Anne E Thessen, Jillian Tinglin, Lucas Tonstad, Thi Tran-Nguyen, Alexander Tropsha, Gaurav Vaidya, Luke Veenhuis, Adam Viola, Marcin Grotthuss, Max Wang, Patrick Wang, Paul B Watkins, Rosina Weber, Qi Wei, Chunhua Weng, Jordan Whitlock, Mark D Williams, Andrew Williams, Finn Womack, Erica Wood, Chunlei Wu, Jiwen Kevin Xin, Hao Xu, Colleen Xu, Chase Yakaboski, Yao Yao, Hong Yi, Arif Yilmaz, Marissa Zheng, Xinghua Zhou, Eric Zhou, Qian Zhu, Tom Zisk, and The Biomedical Data Translator Consortium. Biolink model: A universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin. Transl. Sci., June 2022. ISSN 1752-8054, 1752-8062. doi:10.1111/cts.13302.
  28. OpenAI. OpenAI API. https://platform.openai.com/docs/models, 2023. Accessed: 2023-3-27.
  29. BioPortal: enhanced functionality via new web services from the national center for biomedical ontology to access and use ontologies in software applications. Nucleic Acids Res., 39(suppl):W541–W545, July 2011. ISSN 0305-1048. doi:10.1093/nar/gkr469.
  30. AgroPortal: A vocabulary and ontology repository for agronomy. Comput. Electron. Agric., 144:126–143, January 2018. ISSN 0168-1699. doi:10.1016/j.compag.2017.10.012.
  31. ROBOT: A tool for automating ontology workflows. BMC Bioinformatics, 20(1):407, July 2019. ISSN 1471-2105. doi:10.1186/s12859-019-3002-3.
  32. Dead simple OWL design patterns. J. Biomed. Semantics, 8(1):18, 2017. ISSN 2041-1480. doi:10.1186/s13326-017-0126-0.
  33. Generating ontologies from templates: A Rule-Based approach for capturing regularity. arXiv, page 13, 2018. doi:10.48550/arXiv.1809.10436.
  34. The linked data modeling language (LinkML): A General-Purpose data modeling framework grounded in Machine-Readable semantics. In CEUR Workshop Proceedings, volume 3073, pages 148–151, 2021.
  35. A review of SHACL: From data validation to schema reasoning for RDF graphs. In Reasoning Web. Declarative Artificial Intelligence, pages 115–144. Springer International Publishing, 2022. doi:10.1007/978-3-030-95481-9_6.
  36. JSON schema. http://json-schema.org/, 2022. Accessed: 2023-3-28.
  37. INCATools/ontology-access-kit: v0.2.1. https://github.com/INCATools/ontology-access-kit, March 2023.
  38. The open biomedical annotator. Summit Transl Bioinform, 2009:56–60, March 2009. ISSN 2153-6430.
  39. A new ontology lookup service at EMBL-EBI. http://ceur-ws.org/Vol-1546/paper_29.pdf, 2015. Accessed: 2023-1-3.
  40. linkml/linkml-owl: v0.2.4. https://zenodo.org/record/7384531, December 2022.
  41. Advancing ELK: Not only performance matters. In Diego Calvanese and Boris Konev, editors, Proceedings of the 28th International Workshop on Description Logics (DL-15). CEUR Workshop Proceedings 2015., 2015.
  42. Effects of cromakalim and pinacidil on large epicardial and small coronary arteries in conscious dogs. J. Pharmacol. Exp. Ther., 255(2):836–842, November 1990. ISSN 0022-3565.
  43. Long-term lithium therapy leading to hyperparathyroidism: a case report. Perspect. Psychiatr. Care, 45(1):62–65, January 2009. ISSN 0031-5990, 1744-6163. doi:10.1111/j.1744-6163.2009.00201.x.
  44. Risk of transient hyperammonemic encephalopathy in cancer patients who received continuous infusion of 5-fluorouracil with the complication of dehydration and infection. Anticancer Drugs, 10(3):275–281, March 1999. ISSN 0959-4973. doi:10.1097/00001813-199903000-00004.
  45. UTH-CCB@BioCreative V CDR task: Identifying chemical-induced disease relations in biomedical text. In Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pages 254–259, July 2015.
  46. Structured information extraction from complex scientific text with fine-tuned large language models. arXiv, December 2022. doi:10.48550/arXiv.2212.05238.
  47. LLMs4OL: Large language models for ontology learning. In The Semantic Web – ISWC 2023, pages 408–427. Springer Nature Switzerland, 2023. doi:10.1007/978-3-031-47240-4_22.
  48. MapperGPT: Large language models for linking and mapping entities. arXiv, October 2023. doi:10.48550/arXiv.2310.03666.
  49. Agent-OM: Leveraging large language models for ontology matching. arXiv, December 2023. doi:10.48550/arXiv.2312.00326.
  50. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi:10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  51. Llama: Open and efficient foundation language models. arXiv, 2023. doi:10.48550/arXiv.2302.13971.
  52. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv, 2023. doi:10.48550/arXiv.2303.16199.
  53. BioMedGPT: Open multimodal generative pre-trained transformer for BioMedicine. arXiv, August 2023. doi:10.48550/arXiv.2308.09442.
  54. Radiology-Llama2: Best-in-Class large language model for radiology. arXiv, August 2023. doi:10.48550/arXiv.2309.06419.
  55. The protégé project: A look back and a look forward. AI Matters, 1(4):4–12, June 2015. ISSN 2372-3483. doi:10.1145/2757001.2757003.
  56. NCI thesaurus: a semantic model integrating cancer-related clinical and molecular information. J. Biomed. Inform., 40(1):30–43, February 2007. ISSN 1532-0464, 1532-0480. doi:10.1016/j.jbi.2006.02.013.
  57. The monarch initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res., 45(D1):D712–D722, January 2017. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkw1128.
  58. The human phenotype ontology in 2021. Nucleic Acids Res., 49(D1):D1207–D1217, January 2021. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gkaa1043.
  59. Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res., 47(D1):D955–D962, January 2019. ISSN 0305-1048, 1362-4962. doi:10.1093/nar/gky1032.
Citations (62)

Summary

  • The paper introduces SPIRES, a framework that leverages structured prompt interrogation and recursive semantic extraction to automate knowledge base population.
  • The methodology employs zero-shot learning with large language models, reducing reliance on extensive training data while ensuring precise entity grounding.
  • Evaluated across domains like biomedical data, SPIRES achieves competitive performance compared to trained models by effectively managing complex, nested schemas.

Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): A Methodological Overview

The paper "Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES)" introduces an innovative method for knowledge extraction leveraging LLMs. This methodology aims to automate the population of Knowledge Bases (KBs) using zero-shot learning, thereby reducing the requirement of extensive training data traditionally necessary for such tasks.

Core Contributions

SPIRES operates by processing a text through a user-defined schema, leveraging the LLMs' capability for general-purpose query answering. The method recursively interrogates structured prompts to extract data conforming to the specified schema, while integrating existing ontologies to provide unique identifiers for the elements involved. Importantly, the approach supports complex, nested knowledge schemas, which are often challenging for existing methods to handle without detailed training data.

Methodological Framework

The SPIRES framework consists of the following key steps:

  1. Prompt Generation: Given a schema and input text, a structured prompt is created to instruct the LLM on the expected output format.
  2. Prompt Completion: The prompt is processed by the LLM to generate a response, structured as per the provided template.
  3. Parsing and Recursive Extraction: The response is parsed to identify entities and relationships, employing recursive schema interrogation for nested structures.
  4. Entity Grounding: Extracted entities are grounded using external ontologies, providing reliability by mapping to persistent identifiers from existing vocabularies.
  5. Optional OWL Translation: The extracted data can be translated to Web Ontology Language (OWL) for further reasoning and ontology management tasks.

Evaluation and Results

SPIRES has been evaluated across various domains, including food recipes, cellular signaling pathways, disease treatments, and chemical-disease relationships. Notably, in the BioCreative Chemical-Disease-Relation task, SPIRES demonstrated an F-score competitive with that of trained domain-specific models. The method's ability to perform without the need for specific training data highlights its flexibility and potential for wide applicability.

The system's grounding efficacy was rigorously tested against multiple ontologies, showcasing significant improvements over direct LLM prompting. For example, SPIRES achieved highly accurate entity grounding using the Gene Ontology and other curated datasets, utilizing GPT-3.5-turbo and GPT-4 models.

Implications and Future Directions

SPIRES effectively mitigates some of the common limitations of LLMs, such as hallucinations and contextual misinterpretations, by enforcing extraction through structured schemas and integrating established ontologies. The method's zero-shot learning capability presents significant practical advantages, making it an attractive option for rapid deployment in new domains without bespoke training datasets.

The framework offers a systematic approach to knowledge base population, leveraging AI advancements to synergize with human expertise. Future developments could explore fine-tuning for domain specificity and integration with more publicly accessible and transparent LLMs to enhance acceptance and reliability in critical fields like biomedical data processing.

SPIRES is an open-source component of the OntoGPT package, providing the research community with a tool to transform unstructured text into actionable structured data. Its adaptability and schema-driven methodology align well with contemporary needs for scalable knowledge management solutions, poised for future advancements in AI-driven data curation.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com