TaTa: A Multilingual Table-to-Text Dataset for African Languages (2211.00142v1)

Published 31 Oct 2022 in cs.CL and cs.LG

Abstract: Existing data-to-text generation datasets are mostly limited to English. To address this lack of data, we create Table-to-Text in African languages (TaTa), the first large multilingual table-to-text dataset with a focus on African languages. We created TaTa by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional translation to make the dataset fully parallel. TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yor`ub\'a) and a zero-shot test language (Russian). We additionally release screenshots of the original figures for future research on multilingual multi-modal approaches. Through an in-depth human evaluation, we show that TaTa is challenging for current models and that less than half the outputs from an mT5-XXL-based model are understandable and attributable to the source data. We further demonstrate that existing metrics perform poorly for TaTa and introduce learned metrics that achieve a high correlation with human judgments. We release all data and annotations at https://github.com/google-research/url-nlp.

Authors (7)

Sebastian Gehrmann (48 papers)
Sebastian Ruder (93 papers)
Vitaly Nikolaev (12 papers)
Jan A. Botha (10 papers)
Michael Chavinda (1 paper)
Ankur Parikh (9 papers)
Clara Rivera (8 papers)

Citations (8)

View on Semantic Scholar

Summary

An Overview of "TaTA: A Multilingual Table-to-Text Dataset for African Languages"

The paper "TaTA: A Multilingual Table-to-Text Dataset for African Languages" addresses the paucity of non-English datasets for natural language generation (NLG) and introduces the TaTA dataset, which focuses on African languages. The development of this dataset signifies a pivotal contribution to the field of multilingual NLP by expanding the linguistic diversity available for data-to-text generation tasks.

Dataset Construction and Characteristics

The authors created the TaTA dataset utilizing reports from the Demographic and Health Surveys (DHS) Program, which provides bilingual reports in various languages. The dataset embodies a substantial parallel corpus featuring 8,700 examples across nine languages. These include widely spoken African languages—Hausa, Igbo, Swahili, and Yoruba—and additional languages like Arabic, French, English, Portuguese, and Russian (used as a zero-shot test language). Key characteristics of the TaTA dataset include:

Multilingual and Parallel Nature: It encompasses multiple references for each example, transcribed and professionally translated to ensure full parallelism among languages.
Diverse Linguistic Representation: The dataset involves a mix of colonial and native African languages, echoing the linguistic diversity present in many African nations.
Complexity and Attribution: The dataset challenges contemporary models as many sentences require reasoning over multiple table cells, with a significant portion of sentences demanding comparisons or aggregations.

Evaluation and Model Challenges

The paper reveals a comprehensive human evaluation revealing that existing models struggle with TaTA's complexity. Specifically, an mT5-XXL model's generated outputs were deemed understandable and attributable less than half the time. This outcome highlights the shortcomings of current multilingual models with respect to reasoning and attribution in multilingual environments. Presently available automated metrics exhibited a low correlation with human judgment, prompting the development of new learned metrics, dubbed StATA. These metrics were designed to more faithfully capture the quality of generated text in alignment with human evaluations.

Implications for NLP

The TaTA dataset addresses a critical gap in the NLP community by providing essential resources for languages that are often missed in technology development cycles. Several potential implications emerge from this paper:

Advancement in Cross-Lingual NLP: The dataset provides a unique testbed for exploring cross-lingual transfer, offering crucial insights into how models trained on one language can generalize to others, particularly under-represented languages.
Simulation of Reasoning Capabilities: It offers a well-structured challenge to assess the reasoning and interpretational capabilities of current NLG models—the kind of nuance particularly needed when moving beyond word-level tasks in multilingual settings.
Metric Development: Highlighting the inadequacies of existing metrics in non-Western languages and tasks reiterates the need for additional research into more robust and contextually aware evaluation methods.

Future Research Directions

Building upon the insights provided by this paper, several avenues for future research are worthy of exploration:

Improving Cross-Lingual Model Transfer: Detailed analysis into what makes Swahili an effective intermediate language for transfer could unveil principles applicable to other languages and tasks.
Dataset Expansion: Introducing additional languages and more complex tables can extend the dataset's applicability and enhance the assessment landscape of NLG models.
Metric Refinement: Further enhancement and validation of metrics like StATA, especially in their ability to evaluate high-level reasoning and attribution, will be vital to progress in this research area.

In essence, the TaTA dataset opens doors for further investigation into multilingual data-to-text tasks, continuing to challenge existing paradigms and bringing a necessary spotlight to African languages within the field of NLP research. This work broadens the scope for future studies and provides an essential resource for developing more sophisticated, inclusive, and culturally aware natural language processing technologies.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - google-research/url-nlp (198 stars)