Papers
Topics
Authors
Recent
2000 character limit reached

PubMedQA: A Dataset for Biomedical Research Question Answering

Published 13 Sep 2019 in cs.CL, cs.LG, and q-bio.QM | (1909.06146v1)

Abstract: We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much room for improvement. PubMedQA is publicly available at https://pubmedqa.github.io.

Citations (644)

Summary

  • The paper presents a comprehensive PubMedQA dataset that fills the gap in biomedical question answering by leveraging yes/no/maybe queries from structured abstracts.
  • It employs a multi-phase fine-tuning approach with BioBERT using labeled, unlabeled, and artificially generated data to enhance model reasoning over quantitative data.
  • The study demonstrates that BioBERT with auxiliary supervision achieves 68.1% accuracy, underscoring both its advancement and the ongoing challenges compared to human performance.

PubMedQA: A Dataset for Biomedical Research Question Answering

The study presents PubMedQA, a dataset specifically designed for biomedical question answering (QA) tasks, which fills a gap in the literature by focusing on yes/no/maybe questions derived from structured abstracts in PubMed. PubMedQA aims to evaluate models' abilities to reason over quantitative biomedical research data.

Dataset Composition and Characteristics

PubMedQA comprises three subsets: PQA-L (labeled), PQA-U (unlabeled), and PQA-A (artificially generated). The dataset's structure ensures a comprehensive representation of biomedical research, enabling sophisticated reasoning over abstract content:

  • PQA-L contains 1k expert-annotated instances serving as a robust test set.
  • PQA-U offers 61.2k unlabeled instances extracted through a heuristic filtering process from articles with yes/no/maybe questions.
  • PQA-A includes 211.3k instances generated by converting declarative titles into questions using heuristic rules. Figure 1

    Figure 1: Architecture of PubMedQA dataset. PubMedQA is split into three subsets, PQA-A(rtificial), PQA-U(nlabeled) and PQA-L(abeled).

The dataset predominantly features topics on clinical studies, reflected in its MeSH term distribution (Figure 2), including risk assessment and outcome prediction. Figure 2

Figure 2: MeSH topic distribution of PubMedQA.

Evaluation Framework and Challenges

PubMedQA challenges models to engage in quantitative reasoning, reflected in the questionnaire composition and reasoning strategies required (Figure 3). Typical question formats include:

  • Evaluation of therapeutic effectiveness.
  • Causal relationship assessment.
  • Verification of biomedical statements. Figure 3

    Figure 3: Proportional relationships between corresponded question types, reasoning types, and whether the text interpretations of numbers exist in contexts.

Methodology

The paper presents a multi-phase fine-tuning approach using BioBERT, enhanced by additional supervision via long-answer bag-of-word statistics. This method aims to leverage the unique structure of PubMedQA:

  1. Phase I: Pre-training on PQA-A using the question-and-context configuration.
  2. Phase II: Fine-tuning on bootstrapped instances from PQA-U.
  3. Final Phase: Fine-tuning on PQA-L elements.

This procedure, illustrated in Figure 4, underscores the incremental adaptation process of BioBERT across varying data subsets. Figure 4

Figure 4: Multi-phase fine-tuning architecture. Notations and equations are described in \S \ref{notation}.

Performance Analysis

The adoption of BioBERT with additional supervision demonstrates a marked improvement compared to traditional models like ESIM and baseline methods. Despite this enhancement, performance still trails human benchmark levels on reasoning tasks, highlighting ongoing challenges in the domain:

  • BioBERT achieves 68.1% accuracy, exceeding traditional methods but below human performance (78%).
  • Additional supervision via long-answer statistics enhances model robustness, indicating the importance of auxiliary tasks.

Conclusion and Future Directions

The introduction of PubMedQA represents a significant stride in biomedical NLP, offering an intricate benchmark for evaluating models' capacity to perform reasoning over research abstracts. The paper outlines potential pathways for further exploration:

  1. Enhanced handling of contexts containing numerical data without explicit interpretations.
  2. Advanced auxiliary tasks potentially involving full answer generation, which could refine model capabilities in scientific reasoning.

Overall, PubMedQA stands as a comprehensive resource poised to stimulate advancements in AI-mediated biomedical research interpretation, particularly in applications of evidence-based medicine. The dataset's focus on reasoning over abstracts aligns with real-world clinical decision-making processes, reinforcing its relevance and applicability in biomedical AI development.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.