Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources (2211.15649v1)

Published 28 Nov 2022 in cs.CL and cs.AI

Abstract: While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.

PDF Abstract

A Survey of Multilingual Dataset Construction and Necessary Resources

The paper "Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources" presents a comprehensive meta-analysis of 156 publicly available multilingual NLP datasets. This paper is particularly important as it goes beyond the conventional resource quantification based on dataset counts by examining qualitative aspects of dataset construction, such as the diversity of annotation techniques and input sources. The authors aim to expose the resource disparities across languages and provide recommendations for future multilingual dataset development.

Study Overview

The paper meticulously annotates datasets using a scheme comprising 13 attributes, covering how datasets are created, tasks addressed, motivations, and the tools used. The authors introduce categories such as task type, dataset size, creator, and the source and method of input text and label collection, providing a nuanced perspective on the landscape of multilingual datasets.

Key Findings

Task and Language Coverage: The survey includes datasets covering 222 languages with an average of 5.6 datasets per language. However, the distribution is uneven, with high-resource languages enjoying greater diversity in task types and input sources compared to low-resource ones, which often rely on automatically induced labels and limited sources like Wikipedia.
Dataset Construction Methods: There is a prevalence of automatically induced labels, particularly in low-resource languages. About one-third of the datasets use automatically generated labels, which can compromise quality. Only a limited number of datasets involve manual annotation, either by crowdworkers or domain experts, highlighting the need for more reliable data collection methods across languages.
Translation Utilization: Translation is a common method to create multilingual datasets, especially for cross-lingual evaluation tasks. However, the use of automatic translation poses risks of quality degradation and translation artifacts. Only a fifth of the datasets involve translation, with automatic methods often leading to artifacts.
Motivations and Dataset Creation: Four primary motivations for dataset creation are identified: cross-lingual transfer, multilingual tasks, monolingual tasks, and monolingual general benchmarks. The motivation affects the choice of whether to use translation, with cross-lingual transfer datasets relying heavily on translated data.

Implications and Recommendations

The survey's findings indicate that the disparity in datasets is not only in quantity but also in quality and diversity. There is a strong need for NLP researchers with language proficiency to facilitate robust multilingual dataset creation. Additionally, the availability of language-proficient researchers and crowdworkers significantly correlates with the number of datasets available in a language, suggesting that fostering a more linguistically diverse research community is critical.

The authors provide several recommendations:

Encourage community efforts and create inclusive evaluating and publishing venues to foster research in under-represented languages.
Develop guidelines for enhancing crowdsourcing platforms and gather insights for quality control during the annotation process.
Promote shared tasks to drive dataset creation for low-resource and diverse languages, supporting the expansion of multilingual NLP.

Conclusion

This paper offers valuable insights into the current state of multilingual NLP datasets and highlights the qualitative gaps that exist among different languages. By compiling extensive data on dataset construction processes, the authors provide a roadmap for future research directions and practical strategies to improve multilingual data collection. This work is a significant step toward an equitable AI landscape, ensuring that language technologies are more inclusive and representative of global linguistic diversity.