SeamlessM4T: Massively Multilingual & Multimodal Machine Translation (2308.11596v3)

Published 22 Aug 2023 in cs.CL

Abstract: What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communication

PDF Abstract

Overview of SeamlessM4T: Multilingual and Multimodal Translation

The paper "SeamlessM4T: Massively Multilingual & Multimodal Machine Translation" introduces a significant advancement in the field of machine translation by presenting SeamlessM4T—a unified model designed to facilitate speech and text translation across a multitude of languages and modalities. This research addresses the long-standing challenge in multilingual communication by integrating speech-to-speech (S2ST), speech-to-text (S2TT), text-to-speech (T2ST), text-to-text translation (T2TT), and automatic speech recognition (ASR) into a single system that supports up to 100 languages.

Research Contributions

Multimodal and Multilingual Integration: The authors propose a comprehensive solution to bridge text and speech translation, overcoming the limitations of previous models which often favored text-centric processing or relied on cascaded systems. SeamlessM4T distinguishes itself by being a unified model capable of handling both speech and text inputs and outputs, ensuring seamless integration across different translation tasks.
Large-Scale Data Utilization: Leveraging one million hours of open speech audio data, the team utilized w2v-BERT 2.0 to enhance self-supervised speech representations. Additionally, the creation of SeamlessAlign—a multimodal corpus of over 470,000 hours of speech translations—was a pivotal step in extending the model's reach and efficacy in multilingual communication.
Performance Metrics and Advancements: The results demonstrate that SeamlessM4T sets new benchmarks in translation quality, evidenced by a significant 20% BLEU improvement over existing approaches for direct S2TT translation into multiple languages. Notably, the model exhibits an average improvement of 38% and 49% in handling noisy environments and speaker variations, respectively, compared to previous state-of-the-art models.
Responsible AI and Open Sourcing: The evaluation extends beyond conventional metrics, addressing gender bias and toxicity. The authors report a substantial reduction of up to 63% in added toxicity. All contributions, including models and data, are openly accessible, fostering further research and development in this domain.

Implications and Future Directions

The practical implications of SeamlessM4T are profound, potentially transforming real-time communication across different languages by providing on-demand translation services. Theoretically, this research extends the boundaries of what multimodal models can achieve, integrating self-supervised learning with large-scale datasets to support complex linguistic exchanges.

Future research can build upon this model by exploring advanced areas such as enhancing expressivity in speech outputs, developing low-latency translation systems, and optimizing model performance across varied linguistic landscapes. The focus on multimodality also opens avenues for integrating additional modes of communication, such as visual inputs, to create even more robust translation systems.

In conclusion, SeamlessM4T marks a pivotal step in multilingual and multimodal machine translation, aligning technical innovation with practical applications to support an interconnected, global society. The open-source release further invites the research community to refine and expand upon these contributions, ensuring that the benefits of seamless communication are accessible to a broader spectrum of users worldwide.