Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiTTenS: A Dataset for Evaluating Gender Mistranslation (2401.06935v3)

Published 13 Jan 2024 in cs.CL and cs.CY

Abstract: Translation systems, including foundation models capable of translation, can produce errors that result in gender mistranslation, and such errors can be especially harmful. To measure the extent of such potential harms when translating into and out of English, we introduce a dataset, MiTTenS, covering 26 languages from a variety of language families and scripts, including several traditionally under-represented in digital resources. The dataset is constructed with handcrafted passages that target known failure patterns, longer synthetically generated passages, and natural passages sourced from multiple domains. We demonstrate the usefulness of the dataset by evaluating both neural machine translation systems and foundation models, and show that all systems exhibit gender mistranslation and potential harm, even in high resource languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. An in-depth look at gemini’s language abilities.
  2. The Arabic parallel gender corpus 2.0: Extensions and analyses. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1870–1884, Marseille, France. European Language Resources Association.
  3. Palm 2 technical report.
  4. Gender in danger? evaluating speech translation technology on the MuST-SHE corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6923–6933, Online. Association for Computational Linguistics.
  5. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  6. Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online. Association for Computational Linguistics.
  7. Yang Trista Cao and Hal Daumé III. 2020. Toward gender-inclusive coreference resolution. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4568–4595, Online. Association for Computational Linguistics.
  8. On measuring gender bias in translation of gender-neutral pronouns. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 173–181, Florence, Italy. Association for Computational Linguistics.
  9. Palm: Scaling language modeling with pathways.
  10. Scaling instruction-finetuned language models.
  11. Marta R Costa-jussà. 2019. An analysis of gender bias studies in natural language processing. Nature Machine Intelligence, 1(11):495–496.
  12. Multilingual holistic bias: Extending descriptors and patterns to unveil demographic biases in languages at scale. arXiv preprint arXiv:2305.13198.
  13. Toxicity in multilingual machine translation at scale.
  14. Harms of gender exclusivity and challenges in non-binary representation in language technologies.
  15. Gemini Team Google. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  16. Intrinsic bias metrics do not correlate with application bias.
  17. Misgendered: Limits of large language models in understanding pronouns.
  18. Melvin Johnson. 2018. Providing gender-specific translations in google translate.
  19. Melvin Johnson. 2020. A scalable approach to reducing gender bias in google translate.
  20. Yennie Jun. 2023. Lost in dall-e 3 translation.
  21. Os Keyes. 2018. The misgendering machines: Trans/hci implications of automatic gender recognition. Proc. ACM Hum.-Comput. Interact., 2(CSCW).
  22. Dynabench: Rethinking benchmarking in nlp.
  23. Jack Krawczyk. 2023. Bard’s latest update: more features, languages and countries.
  24. What about “em”? how commercial machine translation fails to handle (neo-)pronouns. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 377–392, Toronto, Canada. Association for Computational Linguistics.
  25. Chelsea Lee. 2019. Welcome, singular "they". https://apastyle.apa.org/blog/singular-they. Accessed: 2022-11-18.
  26. Gpt-4 technical report.
  27. Training language models to follow instructions with human feedback.
  28. Data cards: Purposeful and transparent dataset documentation for responsible ai.
  29. Gender bias amplification during speed-quality optimization in neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 99–109, Online. Association for Computational Linguistics.
  30. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
  31. Gender bias in machine translation. Transactions of the Association for Computational Linguistics, 9:845–874.
  32. Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction.
  33. Pushpdeep Singh. 2023a. Don’t overlook the grammatical gender: Bias evaluation for hindi-english machine translation.
  34. Pushpdeep Singh. 2023b. Gender inflected or bias inflicted: On using grammatical gender cues for bias evaluation in machine translation.
  35. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  36. Karolina Stanczak and Isabelle Augenstein. 2021. A survey on gender bias in natural language processing.
  37. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, Italy. Association for Computational Linguistics.
  38. Romina Stella. 2021. A dataset for studying gender bias in translation.
  39. Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics.
  40. No language left behind: Scaling human-centered machine translation.
  41. Getting gender right in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3003–3008, Brussels, Belgium. Association for Computational Linguistics.
  42. Ethical and social risks of harm from language models.
  43. Sociotechnical safety evaluation of generative ai systems.
  44. Rethinking benchmark and contamination for language models with rephrased samples.
  45. Low-resource languages jailbreak gpt-4.
  46. Synthbio: A case study in human-ai collaborative curation of text datasets.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kevin Robinson (10 papers)
  2. Sneha Kudugunta (14 papers)
  3. Romina Stella (2 papers)
  4. Sunipa Dev (28 papers)
  5. Jasmijn Bastings (19 papers)
Citations (1)

Summary

Introduction to MiTTenS Dataset

Misgendering in translation occurs when a system refers to a person in a way that does not align with their gender identity. This issue is prevalent in machine translation, and previous research has highlighted various instances of gender bias. Advancements have brought about powerful multilingual foundation models capable of translation, yet these could still produce misgendering errors. In this paper, the authors introduce a new dataset, MiTTenS, encompassing 26 languages, which aims to evaluate potential misgendering harms and improve the quality of translation across different language families and scripts.

Dataset Structure and Design

MiTTenS comprises multiple evaluation sets to assess potential harm when translating into English ("2en") and from English into other languages ("2xx"). The structure is designed to facilitate automated evaluation, particularly focused on the expression of grammatical gender in personal pronouns. MiTTenS includes passages from a variety of sources, such as synthetically generated texts and real-world domains, thereby preventing contamination from pre-training data. Moreover, the authors have included 'canaries' to enable robust checks against such contamination.

Evaluation Methodology and Results

The paper details MiTTenS' application in evaluating several neural machine translation systems and foundation models. It highlights the challenges in ensuring systematic and culturally sensitive benchmarking due to global diversity in linguistic expression of gender. Notably, the findings reveal that, while systems generally show high overall accuracy for translations into English, there is a clear discrepancy when translating to feminine pronouns ("she") as opposed to masculine ones ("he"). These discrepancies point to a greater challenge which transcends language resource levels, suggesting the need for precise, targeted improvements even in well-supported languages.

Conclusion and Ethical Considerations

Releasing MiTTenS marks progress toward scaling evaluations to more languages and refining the measurement of potential translation harms. However, the authors recognize limitations such as the dataset's focus on binary gender expressions and the exclusion of non-binary identities. They also acknowledge that while this dataset is an important step toward fairer translation systems, it is not comprehensive in covering all potential gender-related translation harms, thus should not be the basis for certifying systems as harm-free. The authors encourage further research in this direction to support the development of translation technologies reflective of all people's identities.