Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering (2106.04016v1)

Published 8 Jun 2021 in cs.CL

Abstract: Disfluencies is an under-studied topic in NLP, even though it is ubiquitous in human conversation. This is largely due to the lack of datasets containing disfluencies. In this paper, we present a new challenge question answering dataset, Disfl-QA, a derivative of SQuAD, where humans introduce contextual disfluencies in previously fluent questions. Disfl-QA contains a variety of challenging disfluencies that require a more comprehensive understanding of the text than what was necessary in prior datasets. Experiments show that the performance of existing state-of-the-art question answering models degrades significantly when tested on Disfl-QA in a zero-shot setting.We show data augmentation methods partially recover the loss in performance and also demonstrate the efficacy of using gold data for fine-tuning. We argue that we need large-scale disfluency datasets in order for NLP models to be robust to them. The dataset is publicly available at: https://github.com/google-research-datasets/disfl-qa.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aditya Gupta (25 papers)
  2. Jiacheng Xu (41 papers)
  3. Shyam Upadhyay (22 papers)
  4. Diyi Yang (151 papers)
  5. Manaal Faruqui (39 papers)
Citations (31)

Summary

We haven't generated a summary for this paper yet.