MiTTenS: A Dataset for Evaluating Gender Mistranslation (2401.06935v3)

Published 13 Jan 2024 in cs.CL and cs.CY

Abstract: Translation systems, including foundation models capable of translation, can produce errors that result in gender mistranslation, and such errors can be especially harmful. To measure the extent of such potential harms when translating into and out of English, we introduce a dataset, MiTTenS, covering 26 languages from a variety of language families and scripts, including several traditionally under-represented in digital resources. The dataset is constructed with handcrafted passages that target known failure patterns, longer synthetically generated passages, and natural passages sourced from multiple domains. We demonstrate the usefulness of the dataset by evaluating both neural machine translation systems and foundation models, and show that all systems exhibit gender mistranslation and potential harm, even in high resource languages.

View on arXiv

References (47)

Authors (5)

Kevin Robinson (10 papers)
Sneha Kudugunta (14 papers)
Romina Stella (2 papers)
Sunipa Dev (28 papers)
Jasmijn Bastings (19 papers)

Citations (1)

View on Semantic Scholar

Summary

Introduction to MiTTenS Dataset

Misgendering in translation occurs when a system refers to a person in a way that does not align with their gender identity. This issue is prevalent in machine translation, and previous research has highlighted various instances of gender bias. Advancements have brought about powerful multilingual foundation models capable of translation, yet these could still produce misgendering errors. In this paper, the authors introduce a new dataset, MiTTenS, encompassing 26 languages, which aims to evaluate potential misgendering harms and improve the quality of translation across different language families and scripts.

Dataset Structure and Design

MiTTenS comprises multiple evaluation sets to assess potential harm when translating into English ("2en") and from English into other languages ("2xx"). The structure is designed to facilitate automated evaluation, particularly focused on the expression of grammatical gender in personal pronouns. MiTTenS includes passages from a variety of sources, such as synthetically generated texts and real-world domains, thereby preventing contamination from pre-training data. Moreover, the authors have included 'canaries' to enable robust checks against such contamination.

Evaluation Methodology and Results

The paper details MiTTenS' application in evaluating several neural machine translation systems and foundation models. It highlights the challenges in ensuring systematic and culturally sensitive benchmarking due to global diversity in linguistic expression of gender. Notably, the findings reveal that, while systems generally show high overall accuracy for translations into English, there is a clear discrepancy when translating to feminine pronouns ("she") as opposed to masculine ones ("he"). These discrepancies point to a greater challenge which transcends language resource levels, suggesting the need for precise, targeted improvements even in well-supported languages.

Conclusion and Ethical Considerations

Releasing MiTTenS marks progress toward scaling evaluations to more languages and refining the measurement of potential translation harms. However, the authors recognize limitations such as the dataset's focus on binary gender expressions and the exclusion of non-binary identities. They also acknowledge that while this dataset is an important step toward fairer translation systems, it is not comprehensive in covering all potential gender-related translation harms, thus should not be the basis for certifying systems as harm-free. The authors encourage further research in this direction to support the development of translation technologies reflective of all people's identities.

Related Papers

Find Related Papers

Tweets

https://twitter.com/jasmijnbastings/status/1747661034960073181

https://twitter.com/GITT2024/status/1806240028957261825