QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

Published 27 Jul 2021 in cs.CL and cs.AI | (2107.12708v2)

Abstract: Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae for future work. We further discuss the current classifications of "skills" that question answering/reading comprehension systems are supposed to acquire, and propose a new taxonomy. The supplementary materials survey the current multilingual resources and monolingual resources for languages other than English, and we discuss the implications of over-focusing on English. The study is aimed at both practitioners looking for pointers to the wealth of existing data, and at researchers working on new resources.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (151)

View on Semantic Scholar

Summary

The paper introduces a comprehensive taxonomy that classifies a wide spectrum of QA and reading comprehension datasets.
It details methodologies for constructing, curating, and evaluating diverse datasets within the evolving NLP landscape.
The analysis uncovers emerging trends in dataset growth, guiding future resource development and standardized research.

An Examination of the ACM Consolidated Article Template

The paper provides a comprehensive guide to the ACM consolidated article template introduced in 2017, exemplifying its role in establishing a unified \LaTeX\ style across the ACM publications. It covers crucial accessibility and metadata-extraction features, essential for digital archiving and retrieval, aligning with modern publishing standards. This template system is utilized for both conference and journal submissions under the ACM umbrella, streamlining the publication process through minimal modifications to the source files.

Template Overview and Styles

It is emphasized that the "acmart" document class is versatile, accommodating various documentation types by selecting appropriate template styles and parameters. The paper details the different styles offered for journals, including "acmsmall," "acmlarge," and "acmtog," as well as styles for conference proceedings, such as "acmconf," "sigchi," and "sigplan." This flexibility allows authors to select the style that aligns with their publication type within the ACM's extensive framework.

Parameters and Modification Constraints

Several frequently-used template parameters are outlined, such as "anonymous,review" for double-blind reviews, "authorversion" for author-distributed versions, and "screen" for colored hyperlinks. The constraints around template modification are explicitly stated, underlining that unauthorized changes, such as margin alterations or typeface substitutions, will result in the requirement for revisions.

Typography and Organizational Structure

The mandated use of the "Libertine" typeface family is highlighted, reinforcing the standardized visual identity of ACM publications. In terms of document structuring, the paper insists on the adherence to conventional \LaTeX\ sectioning commands. This uniformity contributes to both the print and digital coherence of the ACM's publishing processes.

Rights, Metadata, and Searchability

The publication rights and process around them are articulated, requiring authors to incorporate specific \LaTeX\ commands that embed rights management and reference format text within their documents. Further, authors are urged to utilize ACM's Computing Classification System (CCS) and user-defined keywords to enhance searchability and categorization of their work in the digital domain.

Figures, Tables, and Mathematical Representation

The paper delineates best practices for the presentation of figures and tables, including using the "booktabs" package for tables, and ensuring figures have detailed captions and descriptions. It also covers styles for mathematical equations, differentiating the inline, display, and non-numbered display styles, allowing for consistent representation of academic mathematical content.

Implications and Future Directions

The implications of this consolidated template extend beyond mere aesthetic uniformity, offering a systematized approach for authors and easing the review process across ACM publications. By governing document preparation through a stringent yet flexible template, ACM ensures consistency in visual and structural representation, which aids in metadata extraction and digital library integration.

Looking towards the future, one can anticipate that similar template systems will continue to evolve, possibly adopting more advanced features to accommodate emerging technology in typesetting or digital publishing. As the demands of digital libraries grow, such integrated templates might further encompass compatibility with automated archiving systems and advanced semantic annotation mechanisms, facilitating deeper interconnectivity of academic work.

In summary, the ACM consolidated article template is a crucial tool in modernizing and simplifying the publication process across a vast array of computing literature, making it easier for researchers to publish their work within a widely recognized framework effectively.

Markdown Report Issue