Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
11 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

An Information Retrieval and Extraction Tool for Covid-19 Related Papers (2401.16430v1)

Published 20 Jan 2024 in cs.IR and cs.CL

Abstract: Background: The COVID-19 pandemic has caused severe impacts on health systems worldwide. Its critical nature and the increased interest of individuals and organizations to develop countermeasures to the problem has led to a surge of new studies in scientific journals. Objetive: We sought to develop a tool that incorporates, in a novel way, aspects of Information Retrieval (IR) and Extraction (IE) applied to the COVID-19 Open Research Dataset (CORD-19). The main focus of this paper is to provide researchers with a better search tool for COVID-19 related papers, helping them find reference papers and hightlight relevant entities in text. Method: We applied Latent Dirichlet Allocation (LDA) to model, based on research aspects, the topics of all English abstracts in CORD-19. Relevant named entities of each abstract were extracted and linked to the corresponding UMLS concept. Regular expressions and the K-Nearest Neighbors algorithm were used to rank relevant papers. Results: Our tool has shown the potential to assist researchers by automating a topic-based search of CORD-19 papers. Nonetheless, we identified that more fine-tuned topic modeling parameters and increased accuracy of the research aspect classifier model could lead to a more accurate and reliable tool. Conclusion: We emphasize the need of new automated tools to help researchers find relevant COVID-19 documents, in addition to automatically extracting useful information contained in them. Our work suggests that combining different algorithms and models could lead to new ways of browsing COVID-19 paper data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Automatic Generation of Topic Labels. SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1965–1968, 5 2020. doi:10.1145/3397271.3401185. URL http://arxiv.org/abs/2006.00127http://dx.doi.org/10.1145/3397271.3401185.
  2. SECTOR: A Neural Model for Coherent Topic Segmentation and Classification. Transactions of the Association for Computational Linguistics, 7:169–184, 2 2019. URL http://arxiv.org/abs/1902.04793.
  3. Gene ontology: Tool for the unification of biology, 5 2000. ISSN 10614036. URL /pmc/articles/PMC3037419/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037419/.
  4. Concept annotation in the CRAFT corpus. BMC Bioinformatics, 13(1), 7 2012. ISSN 14712105. doi:10.1186/1471-2105-13-161.
  5. Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data. pages 429–432, 5 2020. doi:10.1145/3383583.3398598. URL http://arxiv.org/abs/2005.05414http://dx.doi.org/10.1145/3383583.3398598.
  6. An ontology for cell types. Genome biology, 6(2):R21, 2005. ISSN 14656914. doi:10.1186/gb-2005-6-2-r21. URL /pmc/articles/PMC551541/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC551541/.
  7. Latent Dirichlet Allocation. Technical report, 2003.
  8. O. Bodenreider. The Unified Medical Language System (UMLS): Integrating biomedical terminology. Nucleic Acids Research, 32(DATABASE ISS.):D267, 1 2004. ISSN 03051048. doi:10.1093/nar/gkh061. URL /pmc/articles/PMC308795/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC308795/.
  9. J. Brainard. Scientists are drowning in COVID-19 papers. Can new tools keep them afloat? Science, 5 2020. ISSN 0036-8075. doi:10.1126/science.abc7839.
  10. SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers. Proceedings of the ACM on Human-Computer Interaction, 2:21, 2018. doi:10.1145/3274300.
  11. BioSentVec: creating sentence embeddings for biomedical texts. 2019 IEEE International Conference on Healthcare Informatics, ICHI 2019, 10 2018. doi:10.1109/ICHI.2019.8904728. URL http://arxiv.org/abs/1810.09302http://dx.doi.org/10.1109/ICHI.2019.8904728.
  12. Keep up with the latest coronavirus research. Nature, 579(7798):193, 2020a. doi:10.1038/d41586-020-00694-1. URL https://www.ncbi.nlm.nih.gov/pubmed/32157233.
  13. Keep up with the latest coronavirus research, 3 2020b. ISSN 14764687.
  14. A scientometric overview of CORD-19. Technical report, 2020. URL https://www.nlpcovid19workshop.org.
  15. Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967. ISSN 15579654. doi:10.1109/TIT.1967.1053964.
  16. Experiment Segmentation in Scientific Discourse as Clause-level Structured Prediction using Recurrent Neural Networks. 2 2017. URL http://arxiv.org/abs/1702.05398.
  17. A. de Waard and H. Pander Maat. Verb form indicates discourse segment type in biological research papers: Experimental evidence. Journal of English for Academic Purposes, 11(4):357–366, 12 2012. ISSN 14751585. doi:10.1016/j.jeap.2012.06.002.
  18. ChEBI: A database and ontology for chemical entities of biological interest. Nucleic Acids Research, 36(SUPPL. 1), 1 2008. ISSN 03051048. doi:10.1093/nar/gkm791. URL https://pubmed.ncbi.nlm.nih.gov/17932057/https://pubmed.ncbi.nlm.nih.gov/17932057/?dopt=Abstract.
  19. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186, 10 2018. URL http://arxiv.org/abs/1810.04805.
  20. The Sequence Ontology: a tool for the unification of genome annotations. Genome biology, 6(5), 2005. ISSN 14656914. doi:10.1186/gb-2005-6-5-r44. URL https://pubmed.ncbi.nlm.nih.gov/15892872/https://pubmed.ncbi.nlm.nih.gov/15892872/?dopt=Abstract.
  21. CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization. 6 2020. URL http://arxiv.org/abs/2006.09595.
  22. J. Hartley. Current findings from research on structured abstracts. Journal of the Medical Library Association, 92(3):368–371, 7 2004. ISSN 15365050. URL /pmc/articles/PMC442180/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC442180/.
  23. C. Herlihy and Y. Liu. Automated Task-Informed Document Retrieval on the COVID-19 Open Research Dataset Using Topic Modeling. Technical report, 7 2020. URL https://radimrehurek.com/gensim/.
  24. DISA: A Scientific Writing Advisor with Deep Information Structure Analysis. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 5229–5231, 2017. doi:10.24963/ijcai.2017/773. URL https://doi.org/10.24963/ijcai.2017/773.
  25. CODA-19: Reliably Annotating Research Aspects on 10,000+ CORD-19 Abstracts Using a Non-Expert Crowd. Technical report, 2020. URL https://github.com/Mimino666/langdetect.
  26. Kaggle. COVID-19 Open Research Dataset Challenge (CORD-19), 2020. URL https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.
  27. Entrez gene: Gene-centered information at NCBI. Nucleic Acids Research, 39(SUPPL. 1), 1 2011. ISSN 03051048. doi:10.1093/nar/gkq1237. URL https://pubmed.ncbi.nlm.nih.gov/21115458/https://pubmed.ncbi.nlm.nih.gov/21115458/?dopt=Abstract.
  28. A. S. Maiya. ktrain: A Low-Code Library for Augmented Machine Learning. 4 2020. URL http://arxiv.org/abs/2004.10703.
  29. Knowledge Graph and Services to Advance COVID-19 Research. Technical report, 11 2020. URL https://github.com/kermitt2/entity-fishing.
  30. Optimizing Semantic Coherence in Topic Models. Technical report, 2011.
  31. D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Technical report, 2007. URL http://projects.ldc.upenn.edu/gale/.
  32. The Protein Ontology: A structured representation of protein forms and complexes. Nucleic Acids Research, 39(SUPPL. 1):D539, 1 2011. ISSN 03051048. doi:10.1093/nar/gkq907. URL /pmc/articles/PMC3013777/?report=abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013777/.
  33. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Technical report, 2019. URL https://github.com/allenai/.
  34. Novavax. Novavax Advances Development of Novel COVID-19 Vaccine | Novavax Inc. - IR Site, 2 2020. URL https://ir.novavax.com/news-releases/news-release-details/novavax-advances-development-novel-covid-19-vaccine.
  35. Exploring the space of topic coherence measures. In WSDM 2015 - Proceedings of the 8th ACM International Conference on Web Search and Data Mining, pages 399–408, New York, New York, USA, 2 2015. Association for Computing Machinery, Inc. ISBN 9781450333177. doi:10.1145/2684822.2685324. URL http://dl.acm.org/citation.cfm?doid=2684822.2685324.
  36. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 37(SUPPL. 1), 2009. ISSN 03051048. doi:10.1093/nar/gkn741. URL https://pubmed.ncbi.nlm.nih.gov/18940862/https://pubmed.ncbi.nlm.nih.gov/18940862/?dopt=Abstract.
  37. C. Sievert and K. E. Shirley. LDAvis: A method for visualizing and interpreting topics. pages 63–70, 2014.
  38. Target specific mining of COVID-19 scholarly articles using one-class approach. Chaos, Solitons and Fractals, 140, 4 2020. doi:10.1016/j.chaos.2020.110155. URL http://arxiv.org/abs/2004.11706http://dx.doi.org/10.1016/j.chaos.2020.110155.
  39. COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research. 8 2020a. URL http://arxiv.org/abs/2008.07880.
  40. COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research. 8 2020b. URL http://arxiv.org/abs/2008.07880.
  41. CORD-19: The Covid-19 Open Research Dataset. 4 2020a. URL http://arxiv.org/abs/2004.10706.
  42. Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision. 3 2020b. URL http://arxiv.org/abs/2003.12218.
  43. COVID-19 Knowledge Graph: Accelerating Information Retrieval and Discovery for Scientific Literature. 17, 7 2020. URL http://arxiv.org/abs/2007.12731.
  44. World Health Organization. COVID-19 STRATEGY UPDATE. Technical report, 2020. URL https://www.who.int/docs/default-source/coronaviruse/covid-strategy-update-14april2020.pdf?sfvrsn=29da3ba0_19.
  45. Topic evolution based on LDA and HMM and its application in stem cell research. Journal of Information Science, 40(5):611–620, 10 2014. ISSN 17416485. doi:10.1177/0165551514540565. URL http://journals.sagepub.com/doi/10.1177/0165551514540565.

Summary

We haven't generated a summary for this paper yet.