Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MELD-ST: An Emotion-aware Speech Translation Dataset (2405.13233v1)

Published 21 May 2024 in cs.CL

Abstract: Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023).
  2. Findings of the IWSLT 2022 evaluation campaign. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022).
  3. Gender in danger? evaluating speech translation technology on the MuST-SHE corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  4. GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio. In Proc. Interspeech 2021.
  5. MuST-C: a Multilingual Speech Translation Corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
  6. Emotion Recognition in Conversations: A Survey Focusing on Context, Speaker Dependencies, and Fusion Methods. Electronics.
  7. Breeding gender-aware direct speech translation systems. In Proceedings of the 28th International Conference on Computational Linguistics.
  8. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. In Proceedings of Language Resources and Evaluation Conference (LREC).
  9. MuST-cinema: a speech-to-subtitles corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference.
  10. CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition. In Speech and Computer.
  11. Direct speech-to-speech translation with discrete units. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  12. EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis.
  13. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
  14. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  15. Robust speech recognition via large-scale weak supervision.
  16. AudioPaLM: A Large Language Model That Can Speak and Listen.
  17. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation.
  18. Seamless: Multilingual Expressive and Streaming Speech Translation.
  19. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
  20. Towards speech dialogue translation mediating speakers of different languages. In Findings of the Association for Computational Linguistics: ACL 2023.
  21. Lost in back-translation: Emotion preservation in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics.
  22. CoVoST 2 and Massively Multilingual Speech Translation. In Proc. Interspeech 2021.
  23. Dialogs re-enacted across languages.
  24. ESPnet: End-to-End Speech Processing Toolkit. In Proc. Interspeech 2018.
  25. GigaST: A 10,000-hour Pseudo Speech Translation Corpus. In Proc. INTERSPEECH 2023.
Citations (1)

Summary

  • The paper presents a novel dataset, MELD-ST, that incorporates emotion labels to enhance the fidelity of speech translation.
  • It details a comprehensive methodology including OCR-based subtitle extraction, text cleaning, alignment, and strategic data splitting for robust emotion annotation.
  • Results show that emotion-aware fine-tuning improves BLEURT and ASR-BLEU scores, especially for the English-to-Japanese task, underscoring key translation gains.

Emotion-Aware Speech Translation: The MELD-ST Dataset

The paper presented focuses on the often-overlooked aspect of emotion in speech translation (ST), introducing the MELD-ST dataset to address this critical gap. The scope encompasses English-to-Japanese (En-Ja) and English-to-German (En-De) translation tasks, with 10,000 annotated utterances per language pair from the Multimodal EmotionLines Dataset (MELD). The contribution is particularly salient given the inherent emotional nuances in human conversations, which are typically conveyed through vocal tones, facial expressions, and other multimodal cues, and thus have significant ramifications for NLP tasks.

Introduction and Motivation

The introduction delineates the motivation behind the research, emphasizing the indispensability of accurately conveying emotions in cross-linguistic translation to preserve the intended intensity and sentiment. This is exemplified by the phrase "Oh my God!" which can vary significantly in translation depending on its emotional context. Prior initiatives in machine translation (MT) have begun to explore emotion-aware translation, but these efforts have largely been confined to text-to-text translation (T2TT). Conversely, there has been scant attention to emotion in speech-to-text translation (S2TT) and speech-to-speech translation (S2ST), despite marked improvements in ST performance with the advent of sophisticated datasets and models.

MELD-ST Dataset Creation

The MELD-ST dataset emerges as a novel resource, comprising approximately 10,000 utterances per language pair. The data is mined from the TV series Friends, inheriting the emotion labels from the MELD dataset. Table \ref{tab:dataset} provides a detailed breakdown of the dataset statistics, including the number of utterances and the duration of English and target language speech.

Subtitles and Timestamp Extraction: This phase involved utilizing OCR tools to convert subtitle images into text and aligning these texts with speech using timestamps.

Text Cleaning and Alignment: Heuristics were employed to mitigate OCR errors and speaker name duplications, followed by a careful alignment process that combined audio extraction and CTC segmentation for precise timestamp corrections.

Data Splitting and Emotion Label Distribution: The dataset was meticulously split into training, development, and test sets, with special attention to emotion label distribution to ensure the robustness of experimental analyses. Table \ref{tab:Emotion} delivers insights into the emotion distribution within the dataset.

Experimental Settings

Baseline models for both S2TT and S2ST tasks were derived from the SeamlessM4T v2 model, utilizing specialized training configurations and comparison approaches:

  • No fine-tuning
  • Fine-tuning without emotion labels
  • Fine-tuning with emotion labels

Three data configurations were used for fine-tuning: separate En-Ja and En-De datasets, and a mixed dataset combining both. The evaluation metrics included BLEURT for S2TT and ASR-BLEU for S2ST, with prosody evaluation metrics outlined in supplementary materials.

Results and Discussion

S2TT Results: The fine-tuning with emotion labels notably improved the translation quality in certain configurations, as seen in Table \ref{tab:s2tt-results}. For instance, incorporating emotion labels resulted in a statistically significant improvement in BLEURT scores for the En-Ja pair, affirming the pertinence of emotion annotation to translation fidelity.

S2ST Results: Fine-tuning the SeamlessM4T model enhanced the ASR-BLEU results, albeit modestly. The prosody and vocal similarity metrics remained largely unchanged, exposing the limitations of SeamlessM4T in capturing nuanced pronunciation features, a gap that may benefit from models such as SeamlessExpressive dedicated to prosodic fidelity (Table \ref{tab:s2st-results}).

Discussion: The En-De pair consistently outperformed En-Ja across both tasks, attributable to the linguistic proximity between English and German. The manual inspection suggested that emotion labels did not substantially alter the translation outcome in these language pairs, highlighting a potential area for future investigation into more refined emotion-sensitive translation mechanisms.

Conclusion and Limitations

This paper introduces the MELD-ST dataset, a pioneering corpus designed to advance emotion-aware speech translation. Initial experiments underscored the potential of emotion labels in enhancing translation quality, particularly for language pairs with significant lexical and cultural divergences. Future research could pivot towards integrating multitask models that concurrently train for speech emotion recognition and ST, and exploring dialogue context in translation for a more holistic approach.

Limitations: Alignment discrepancies in the dataset and the reliance on acted speech underscore the necessity for further research in spontaneous dialogue contexts. Moreover, the basic ST models used could be augmented with more sophisticated architectures designed explicitly for emotion-aware applications.

Ethics Statement: The dataset will be made available under restricted access to prevent misuse and ensure it serves the intended purpose of advancing research in emotion-aware speech translation.

In closing, the introduction of MELD-ST marks a significant foray into the nuanced field of emotion-aware ST, laying the groundwork for future explorations into the seamless integration of emotional contexts in automated translation technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com