CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization (2006.09595v1)

Published 17 Jun 2020 in cs.IR, cs.AI, and cs.CL

Abstract: The COVID-19 global pandemic has resulted in international efforts to understand, track, and mitigate the disease, yielding a significant corpus of COVID-19 and SARS-CoV-2-related publications across scientific disciplines. As of May 2020, 128,000 coronavirus-related publications have been collected through the COVID-19 Open Research Dataset Challenge. Here we present CO-Search, a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature, potentially aiding overburdened health workers in finding scientific answers during a time of crisis. The retriever is built from a Siamese-BERT encoder that is linearly composed with a TF-IDF vectorizer, and reciprocal-rank fused with a BM25 vectorizer. The ranker is composed of a multi-hop question-answering module, that together with a multi-paragraph abstractive summarizer adjust retriever scores. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations, creating 1.3 million (citation title, paragraph) tuples for training the encoder. We evaluate our system on the data of the TREC-COVID information retrieval challenge. CO-Search obtains top performance on the datasets of the first and second rounds, across several key metrics: normalized discounted cumulative gain, precision, mean average precision, and binary preference.

Authors (7)

Andre Esteva (7 papers)
Anuprit Kale (1 paper)
Romain Paulus (4 papers)
Kazuma Hashimoto (34 papers)
Wenpeng Yin (69 papers)
Dragomir Radev (98 papers)
Richard Socher (115 papers)

Citations (62)

View on Semantic Scholar

Summary

An Overview of CO-Search: Integrating Advanced IR Techniques for COVID-19 Literature

The paper "CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization" details a sophisticated information retrieval (IR) system designed to process and extract valuable insights from a vast corpus of COVID-19 scientific literature. This work addresses the need for efficient information retrieval tools to help clinicians, researchers, and policymakers navigate an ever-growing body of COVID-19-related research.

CO-Search employs a multi-faceted approach that integrates semantic search capabilities with advanced question answering (QA) and summarization functionalities. The retriever component synergizes a Siamese-BERT (SBERT) model with traditional keyword-based methods such as TF-IDF and BM25. This hybrid strategy leverages both semantic representations and keyword frequency to enhance the retrieval accuracy of COVID-19-related documents.

The SBERT model is pivotal in the architecture, as it facilitates the embedding of textual queries and documents into a shared latent space, enabling semantic overlap to be efficiently captured. The paper outlines the training of this model using a bipartite graph constructed from paragraph-citation pairs, fostering a robust semantic understanding well-suited for this domain. The subsequent integration with TF-IDF and BM25 scores exploits reciprocal rank fusion, blending semantic retrieval with keyword-based scores for comprehensive document retrieval.

Emphasizing context-sensitive retrieval, the ranker module deploys a QA engine complemented by an abstractive summarizer. The QA system utilizes a multi-hop reasoning approach, capable of tracing complex inter-paragraph relations to reinforce the relevance of retrieved documents by assessing their capacity to answer user queries. The summarizer employs an encoder-decoder model, combining a BERT encoder with a modified GPT-2 decoder, to generate concise summaries of the retrieved articles, thereby assisting users in quickly apprehending the core information.

Evaluation on the TREC-COVID challenge datasets demonstrates CO-Search’s effectiveness. The system secures top placements across several automatic metrics such as normalized discounted cumulative gain (nDCG), precision at specified intervals (P@5, P@10), mean average precision (MAP), and binary preference (Bpref). These outcomes affix its utility in automatic information retrieval contexts, demonstrating a superior capability to distill meaningful insights from a dense and rapidly evolving research corpus.

Practically, CO-Search is positioned to support the global research community amidst a pandemic by ensuring access to relevant, up-to-date information, potentially guiding both academic inquiry and public health decision-making processes. Theoretically, its architecture underscores the potential of blending contemporary neural approaches with established IR methodologies, charting a path for future IR systems handling specialized and voluminous data collections.

Further evolution of this work may explore domain adaptation strategies that could refine SBERT embeddings with even richer COVID-19-specific semantics. Additionally, exploration into real-time updates and dynamic retraining mechanisms could enhance responsiveness to newly emerging literature. The authors’ commitment to open source the system lays a foundation for collaborative enhancements and adaptations by the broader research community, promising continued improvements and broader applications beyond the current pandemic scenario.

Related Papers

YouTube

Show All Videos