CEAR: Automatic construction of a knowledge graph of chemical entities and roles from scientific literature (2407.21708v1)
Abstract: Ontologies are formal representations of knowledge in specific domains that provide a structured framework for organizing and understanding complex information. Creating ontologies, however, is a complex and time-consuming endeavor. ChEBI is a well-known ontology in the field of chemistry, which provides a comprehensive resource for defining chemical entities and their properties. However, it covers only a small fraction of the rapidly growing knowledge in chemistry and does not provide references to the scientific literature. To address this, we propose a methodology that involves augmenting existing annotated text corpora with knowledge from Chebi and fine-tuning a LLM to recognize chemical entities and their roles in scientific text. Our experiments demonstrate the effectiveness of our approach. By combining ontological knowledge and the language understanding capabilities of LLMs, we achieve high precision and recall rates in identifying both the chemical entities and roles in scientific literature. Furthermore, we extract them from a set of 8,000 ChemRxiv articles, and apply a second LLM to create a knowledge graph (KG) of chemical entities and roles (CEAR), which provides complementary information to ChEBI, and can help to extend it.
- C. T. Supuran, Progress of section “biochemistry” in 2022, 2023. URL: https://www.mdpi.com/1422-0067/24/6/5873. doi:10.3390/ijms24065873.
- What are ontologies, and why do we need them?, IEEE Intelligent Systems and their Applications 14 (1999) 20–26. doi:10.1109/5254.747902.
- The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013, Nucleic Acids Research 41 (2012) D456–D463. doi:10.1093/nar/gks1146.
- Scholarly knowledge graphs through structuring scholarly communication: a review, Complex & Intelligent Systems 9 (2023) 1059–1095. doi:10.1007/s40747-022-00806-6.
- Scalable, semi-supervised extraction of structured information from scientific literature, in: V. Nastase, B. Roth, L. Dietz, A. McCallum (Eds.), Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications, Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 11–20. URL: https://aclanthology.org/W19-2602. doi:10.18653/v1/W19-2602.
- Comprehensive named entity recognition on CORD-19 with distant or weak supervision, CoRR abs/2003.12218 (2020). doi:10.48550/arXiv.2003.12218.
- Open information extraction for knowledge graph construction, in: G. Kotsis, A. M. Tjoa, I. Khalil, L. Fischer, B. Moser, A. Mashkoor, J. Sametinger, A. Fensel, J. Martinez-Gil (Eds.), Database and Expert Systems Applications, Springer International Publishing, Cham, 2020, pp. 103–113. doi:10.1007/978-3-030-59028-4_10.
- Kgen: a knowledge graph generator from biomedical scientific literature, BMC Medical Informatics and Decision Making 20 (2020) 314. doi:10.1186/s12911-020-01341-5.
- FORUM: building a Knowledge Graph from public databases and scientific literature to extract associations between chemicals and diseases, Bioinformatics 37 (2021) 3896–3904. doi:10.1093/bioinformatics/btab627.
- Chebi in 2016: Improved services and an expanding collection of metabolites, Nucleic Acids Research 44 (2015) D1214–D1219. doi:10.1093/nar/gkv1031.
- BERT: pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10.18653/V1/N19-1423.
- Assessing the state of the art in biomedical relation extraction: overview of the biocreative v chemical-disease relation (cdr) task, Database 2016 (2016). doi:10.1093/database/baw032.
- Nlm-chem, a new resource for chemical entity recognition in pubmed full text literature, Scientific data 8 (2021) 91. doi:10.1038/s41597-021-00875-1.
- The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain, Handbook of Linguistic annotation (2017) 1379–1394. doi:10.1007/978-94-024-0881-2_53.