Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

QuAC : Question Answering in Context (1808.07036v3)

Published 21 Aug 2018 in cs.CL, cs.AI, and cs.LG

Abstract: We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-of-the-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.

QuAC: Question Answering in Context

The paper "QuAC: Question Answering in Context" by Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer presents a novel dataset designed for studying Question Answering (QA) in a conversational context. The dataset, QuAC, comprises 14,000 information-seeking dialogues with a total of 100,000 questions, emphasizing an interaction between a student, who asks questions to gather information, and a teacher, who answers based on hidden Wikipedia passages.

Challenges and Contributions

QuAC differs significantly from existing machine comprehension datasets due to several intrinsic challenges:

  1. Open-ended Questions: Unlike datasets where questions are often fact-based and have straightforward answers, questions in QuAC tend to be open-ended and exploratory.
  2. Contextuality: The questions are deeply contextual, often relying on the dialog history for meaning.
  3. Unanswerability: A notable fraction of questions in the dataset are intrinsically unanswerable from the provided text, introducing additional complexity.
  4. Diverse Answer Types: The answers involve excerpts from the text rather than simple entity recognition or fact retrieval, increasing the need for comprehensive passage understanding.

Dataset Collection and Specifications

The dataset was crowdsourced via interactions between two workers playing the roles of student and teacher. The student, having access to only a section title and a brief introductory paragraph, engages in a dialog by asking questions about the hidden section. The teacher responds with exact spans from the text, yes/no verdicts, or indicates the unanswerability of the question. The dataset also includes dialog acts encouraging specific types of follow-up questions, making it uniquely suited for developing models that can maintain coherent long-term dialog contexts.

Baseline Models and Evaluation

The paper evaluates numerous baseline models:

  1. Sanity Checks: Included are simple baselines such as random sentence selection and majority answer baselines, which predict common outcomes without deep text understanding.
  2. BiDAF++: Enhancements to the BiDAF model, incorporating contextual embeddings, show significant promise.
  3. Context-aware models: BiDAF++ is further tuned to incorporate k turns of dialog history, demonstrating that contextual embedding significantly improves the performance, but still falls short of human performance by a notable margin.

Results

  • Numerical Performance: The best model achieved an F1 score of 60.6, compared to the human upper bound of 80.8 F1. The model's performance on affirmation and continuation acts is strong, but it struggles particularly with unanswerable questions and shifts in dialog topic.
  • Human Equivalence Scores (HEQ): The HEQ metrics, which denote the percentage of questions or dialogs where the model's output meets or exceeds human performance, underscore the complexity of the task. The top model achieves 55.7% HEQ-Q and 5.3% HEQ-D, suggesting substantial room for improvement.

Implications for Future Research

The results from QuAC highlight several promising research directions in the domain of AI and natural language processing:

  1. Advanced Contextual Models: As demonstrated, incorporating dialog history critically impacts performance. Further innovation in context-aware architectures is likely necessary.
  2. Handling Unanswerability: The dataset's inclusion of unanswerable questions demands models that can gracefully handle and identify such scenarios.
  3. Dialog Coherence and Continuity: The dialog flow, with evolving question complexity and student engagement, hints at the need for models that can mimic human-like conversational capabilities over extended interactions.

Conclusion

QuAC presents a formidable challenge to current question answering systems, pushing the boundaries of contextual comprehension, answerability detection, and dialog management. The dataset serves as a crucial step towards developing AI capable of engaging in meaningful and contextually aware information-seeking conversations. The substantial gap between human performance and current model capabilities indicates significant potential for future advancements in this field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Eunsol Choi (76 papers)
  2. He He (71 papers)
  3. Mohit Iyyer (87 papers)
  4. Mark Yatskar (38 papers)
  5. Wen-tau Yih (84 papers)
  6. Yejin Choi (287 papers)
  7. Percy Liang (239 papers)
  8. Luke Zettlemoyer (225 papers)
Citations (783)