ELI5: Long Form Question Answering (1907.09190v1)

Published 22 Jul 2019 in cs.CL

Abstract: We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions. The dataset comprises 270K threads from the Reddit forum ``Explain Like I'm Five'' (ELI5) where an online community provides answers to questions which are comprehensible by five year olds. Compared to existing datasets, ELI5 comprises diverse questions requiring multi-sentence answers. We provide a large set of web documents to help answer the question. Automatic and human evaluations show that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, LLMing, as well as a strong extractive baseline. However, our best model is still far from human performance since raters prefer gold responses in over 86% of cases, leaving ample opportunity for future improvement.

Authors (6)

Angela Fan (49 papers)
Yacine Jernite (47 papers)
Ethan Perez (55 papers)
David Grangier (55 papers)
Jason Weston (130 papers)
Michael Auli (73 papers)

Citations (522)

View on Semantic Scholar

Summary

Understanding Long-Form Question Answering: Insights from the ELI5 Dataset

The paper "ELI5: Long Form Question Answering" presents a comprehensive exploration into long-form question answering (LFQA) by introducing the ELI5 dataset. This dataset is a substantial contribution as it encapsulates 270K threads from the subreddit "Explain Like I'm Five." The primary objective is to produce multi-sentence, explanatory answers to complex, open-ended questions, drawing information from extensive web documents.

Dataset Characteristics and Challenges

The ELI5 dataset markedly differs from previous question answering datasets which focus predominantly on extractive or short-form answers. With an average of 42.2 words per question and 130.6 words per answer, ELI5 requires extensive reasoning and synthesis of information. The dataset is equipped with supporting documents averaging 857.6 words, extracted via a meticulous process involving TFIDF ranking and context selection.

Several challenges are evident:

Length and Diversity: Questions and answers span multiple sentences, addressing complex queries that cannot be resolved with simple, extractive methods.
Document Utilization: The dataset demands efficient retrieval and abstraction of information from lengthy sources, a task that challenges existing models' capabilities.

Model Evaluation and Performance

The research employs a range of model architectures, benchmarking extractive and abstractive techniques. Specifically, the Seq2Seq model with multi-task training emerges as a superior approach compared to standard Seq2Seq, LLMing, and extractive baselines such as BidAF. This model incorporates various training tasks, including LLMing and masked word prediction, which enhance its ability to generate coherent and relevant answers.

Despite advancements, the multi-task model's outputs are still notably less preferred compared to human-generated answers, with human raters favoring gold responses 86% of the time. This gap underscores the limitations of current methods in emulating the nuanced understanding and comprehensive expression characteristic of human cognition.

Implications and Future Directions

The implications of this research are manifold. Practically, it offers a framework for developing systems capable of engaging in more natural, human-like discourse. Theoretically, it exposes the limitations of existing models, highlighting the necessity for advancements in multi-document synthesis, contextual understanding, and linguistic fluency.

Future directions could explore:

Enhanced Retrieval Algorithms: Improving document filtering and selection to increase the relevance and completeness of support data, potentially through advanced neural retrieval methods.
Contextual Reasoning: Enhancing models' abilities to integrate and reason over multiple pieces of information, possibly leveraging advancements in neural symbolic reasoning.
Expanded Training Data: Utilizing multiple reference answers to provide richer training signals, thereby improving model generalization and robustness.

In conclusion, the ELI5 dataset and the accompanying research represent a significant step towards sophisticated long-form question answering. The path ahead involves complex challenges but holds the promise of transforming human-computer interaction by fostering systems capable of generating articulated and meaningful insights across diverse subject matters.

PDF Markdown

Related Papers

Find Related Papers