Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates (1911.03167v3)

Published 8 Nov 2019 in cs.CL, cs.SD, and eess.AS

Abstract: Current research into spoken language translation (SLT),or speech-to-text translation, is often hampered by the lack of specific data resources for this task, as currently available SLT datasets are restricted to a limited set of language pairs. In this paper we present Europarl-ST, a novel multilingual SLT corpus containing paired audio-text samples for SLT from and into 6 European languages, for a total of 30 different translation directions. This corpus has been compiled using the debates held in the European Parliament in the period between 2008 and 2012. This paper describes the corpus creation process and presents a series of automatic speech recognition, machine translation and spoken language translation experiments that highlight the potential of this new resource. The corpus is released under a Creative Commons license and is freely accessible and downloadable.

Authors (8)

Javier Iranzo-Sánchez (6 papers)
Joan Albert Silvestre-Cerdà (1 paper)
Javier Jorge (1 paper)
Nahuel Roselló (1 paper)
Adrià Giménez (3 papers)
Albert Sanchis (3 papers)
Jorge Civera (6 papers)
Alfons Juan (7 papers)

Citations (170)

View on Semantic Scholar

Summary

Overview of Europarl-ST: A Multilingual Speech Translation Corpus

The paper presents Europarl-ST, a comprehensive multilingual corpus designed for speech translation (ST) research, which develops a new dataset derived from the debates of the European Parliament covering the years 2008 to 2012. This work addresses a significant shortfall in the field of spoken language translation (SLT)—the lack of diverse language pairs. The corpus encompasses speech-to-text translation in six European languages—English, German, French, Spanish, Italian, and Portuguese—allowing for a multitude of 30 translation directions.

Data Collection and Processing

The dataset creation utilized publicly available video recordings from the European Parliament, structured via the LinkedEP database. The raw data suffered from incompleteness and inaccuracies, particularly with timestamps, partial recordings, and the absence of translations post-2012, necessitating a rigorous preprocessing pipeline. The authors employed both speaker diarization and forced alignment techniques to accurately match audio segments with transcriptions and translations.

An essential step involved filtering data based on character error rate (CER) to ensure high-quality samples. Notably, around 60-80% of the original data was discarded through this thorough vetting process. The corpus supports multilingual initiatives with speaker-independent train/dev/test splits designed to facilitate comprehensive testing environments across different languages.

Experimental Setup and Results

The Europarl-ST corpus supports experiments in automatic speech recognition (ASR), machine translation (MT), and SLT using a cascade approach—combining ASR outputs as inputs to MT systems. The ASR systems developed for this corpus rely on hybrid DNN-HMM approaches, with performance metrics recorded in terms of Word Error Rate (WER). The authors observed WER scores generally below 20% across most language pairs, highlighting substantial initial efficacy barring exceptions, such as those involving French as input.

MT models were executed using a Transformer architecture, fine-tuned with the Europarl-ST data to create robust in-domain systems, which exhibited BLEU score improvements, ranging from 1.9 to 4.0, indicating successful domain adaptation. The cascade SLT experiments revealed typical BLEU score decrements relative to MT alone, an anticipated outcome due to error propagation from ASR and non-optimal input segmentation.

Implications and Future Work

This paper contributes significantly to SLT research by providing a publicly accessible multilingual corpus, potentially enabling comprehensive explorations into multilingual and multidirectional ST tasks. The Europarl-ST corpus's introduction marks an advance towards more inclusive datasets accommodating less-represented language pairs beyond the predominant focus on English-centric translations.

Looking forward, enhancements can be achieved by incorporating more languages and optimizing filtering methods to expand usable data volume. There is also a planned pivot towards evaluating end-to-end models to contrast cascade systems, which could leverage noisy data augmentation techniques to bring MT outputs closer to real ASR scenarios.

This research provides a substantial addition to the ST field, offering critical resources and insights for future developments in multilingual machine learning and SLT model optimizations.

PDF Markdown

Related Papers

YouTube

Show All Videos