An Information Retrieval and Extraction Tool for Covid-19 Related Papers (2401.16430v1)
Abstract: Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.
- Automatic Generation of Topic Labels. SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1965–1968, 5 2020. doi:10.1145/3397271.3401185. URL http://arxiv.org/abs/2006.00127http://dx.doi.org/10.1145/3397271.3401185.
- SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Transactions of the Association for Computational Linguistics, 7:169–184, 2 2019. URL http://arxiv.org/abs/1902.04793.
- Gene ontology: Tool for the unification of biology, 5 2000. ISSN 10614036. URL /pmc/articles/PMC3037419/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037419/.
- Concept annotation in the CRAFT corpus. BMC Bioinformatics, 13(1), 7 2012. ISSN 14712105. doi:10.1186/1471-2105-13-161.
- Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data. pages 429–432, 5 2020. doi:10.1145/3383583.3398598. URL http://arxiv.org/abs/2005.05414http://dx.doi.org/10.1145/3383583.3398598.
- An ontology for cell types. Genome biology, 6(2):R21, 2005. ISSN 14656914. doi:10.1186/gb-2005-6-2-r21. URL /pmc/articles/PMC551541/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC551541/.
- Latent Dirichlet Allocation. Technical report, 2003.
- O. Bodenreider. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(DATABASE ISS.):D267, 1 2004. ISSN 03051048. doi:10.1093/nar/gkh061. URL /pmc/articles/PMC308795/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC308795/.
- J. Brainard. Scientists are drowning in COVID-19 papers. Can new tools keep them afloat? Science, 5 2020. ISSN 0036-8075. doi:10.1126/science.abc7839.
- SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers. Proceedings of the ACM on Human-Computer Interaction, 2:21, 2018. doi:10.1145/3274300.
- BioSentVec: creating sentence embeddings for biomedical texts. 2019 IEEE International Conference on Healthcare Informatics, ICHI 2019, 10 2018. doi:10.1109/ICHI.2019.8904728. URL http://arxiv.org/abs/1810.09302http://dx.doi.org/10.1109/ICHI.2019.8904728.
- Keep up with the latest coronavirus research. Nature, 579(7798):193, 2020a. doi:10.1038/d41586-020-00694-1. URL https://www.ncbi.nlm.nih.gov/pubmed/32157233.
- Keep up with the latest coronavirus research, 3 2020b. ISSN 14764687.
- A scientometric overview of CORD-19. Technical report, 2020. URL https://www.nlpcovid19workshop.org.
- Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967. ISSN 15579654. doi:10.1109/TIT.1967.1053964.
- Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks. 2 2017. URL http://arxiv.org/abs/1702.05398.
- A. de Waard and H. Pander Maat. Verb form indicates discourse segment type in biological research papers: Experimental evidence. Journal of English for Academic Purposes, 11(4):357–366, 12 2012. ISSN 14751585. doi:10.1016/j.jeap.2012.06.002.
- ChEBI: A database and ontology for chemical entities of biological interest. Nucleic Acids Research, 36(SUPPL. 1), 1 2008. ISSN 03051048. doi:10.1093/nar/gkm791. URL https://pubmed.ncbi.nlm.nih.gov/17932057/https://pubmed.ncbi.nlm.nih.gov/17932057/?dopt=Abstract.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186, 10 2018. URL http://arxiv.org/abs/1810.04805.
- The Sequence Ontology: a tool for the unification of genome annotations. Genome biology, 6(5), 2005. ISSN 14656914. doi:10.1186/gb-2005-6-5-r44. URL https://pubmed.ncbi.nlm.nih.gov/15892872/https://pubmed.ncbi.nlm.nih.gov/15892872/?dopt=Abstract.
- CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. 6 2020. URL http://arxiv.org/abs/2006.09595.
- J. Hartley. Current findings from research on structured abstracts. Journal of the Medical Library Association, 92(3):368–371, 7 2004. ISSN 15365050. URL /pmc/articles/PMC442180/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC442180/.
- C. Herlihy and Y. Liu. Automated Task-Informed Document Retrieval on the COVID-19 Open Research Dataset Using Topic Modeling. Technical report, 7 2020. URL https://radimrehurek.com/gensim/.
- DISA: A Scientific Writing Advisor with Deep Information Structure Analysis. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 5229–5231, 2017. doi:10.24963/ijcai.2017/773. URL https://doi.org/10.24963/ijcai.2017/773.
- CODA-19: Reliably Annotating Research Aspects on 10,000+ CORD-19 Abstracts Using a Non-Expert Crowd. Technical report, 2020. URL https://github.com/Mimino666/langdetect.
- Kaggle. COVID-19 Open Research Dataset Challenge (CORD-19), 2020. URL https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.
- Entrez gene: Gene-centered information at NCBI. Nucleic Acids Research, 39(SUPPL. 1), 1 2011. ISSN 03051048. doi:10.1093/nar/gkq1237. URL https://pubmed.ncbi.nlm.nih.gov/21115458/https://pubmed.ncbi.nlm.nih.gov/21115458/?dopt=Abstract.
- A. S. Maiya. ktrain: A Low-Code Library for Augmented Machine Learning. 4 2020. URL http://arxiv.org/abs/2004.10703.
- Knowledge Graph and Services to Advance COVID-19 Research. Technical report, 11 2020. URL https://github.com/kermitt2/entity-fishing.
- Optimizing Semantic Coherence in Topic Models. Technical report, 2011.
- D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Technical report, 2007. URL http://projects.ldc.upenn.edu/gale/.
- The Protein Ontology: A structured representation of protein forms and complexes. Nucleic Acids Research, 39(SUPPL. 1):D539, 1 2011. ISSN 03051048. doi:10.1093/nar/gkq907. URL /pmc/articles/PMC3013777/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013777/.
- ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Technical report, 2019. URL https://github.com/allenai/.
- Novavax. Novavax Advances Development of Novel COVID-19 Vaccine | Novavax Inc. - IR Site, 2 2020. URL https://ir.novavax.com/news-releases/news-release-details/novavax-advances-development-novel-covid-19-vaccine.
- Exploring the space of topic coherence measures. In WSDM 2015 - Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pages 399–408, New York, New York, USA, 2 2015. Association for Computing Machinery, Inc. ISBN 9781450333177. doi:10.1145/2684822.2685324. URL http://dl.acm.org/citation.cfm?doid=2684822.2685324.
- Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 37(SUPPL. 1), 2009. ISSN 03051048. doi:10.1093/nar/gkn741. URL https://pubmed.ncbi.nlm.nih.gov/18940862/https://pubmed.ncbi.nlm.nih.gov/18940862/?dopt=Abstract.
- C. Sievert and K. E. Shirley. LDAvis: A method for visualizing and interpreting topics. pages 63–70, 2014.
- Target specific mining of COVID-19 scholarly articles using one-class approach. Chaos, Solitons and Fractals, 140, 4 2020. doi:10.1016/j.chaos.2020.110155. URL http://arxiv.org/abs/2004.11706http://dx.doi.org/10.1016/j.chaos.2020.110155.
- COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research. 8 2020a. URL http://arxiv.org/abs/2008.07880.
- COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research. 8 2020b. URL http://arxiv.org/abs/2008.07880.
- CORD-19: The Covid-19 Open Research Dataset. 4 2020a. URL http://arxiv.org/abs/2004.10706.
- Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision. 3 2020b. URL http://arxiv.org/abs/2003.12218.
- COVID-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature. 17, 7 2020. URL http://arxiv.org/abs/2007.12731.
- World Health Organization. COVID-19 STRATEGY UPDATE. Technical report, 2020. URL https://www.who.int/docs/default-source/coronaviruse/covid-strategy-update-14april2020.pdf?sfvrsn=29da3ba0_19.
- Topic evolution based on LDA and HMM and its application in stem cell research. Journal of Information Science, 40(5):611–620, 10 2014. ISSN 17416485. doi:10.1177/0165551514540565. URL http://journals.sagepub.com/doi/10.1177/0165551514540565.