Sequence-to-Sequence Models Can Directly Translate Foreign Speech (1703.08581v2)

Published 24 Mar 2017 in cs.CL, cs.LG, and stat.ML

Abstract: We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another. The model does not explicitly transcribe the speech into text in the source language, nor does it require supervision from the ground truth source language transcription during training. We apply a slightly modified sequence-to-sequence with attention architecture that has previously been used for speech recognition and show that it can be repurposed for this more complex task, illustrating the power of attention-based models. A single model trained end-to-end obtains state-of-the-art performance on the Fisher Callhome Spanish-English speech translation task, outperforming a cascade of independently trained sequence-to-sequence speech recognition and machine translation models by 1.8 BLEU points on the Fisher test set. In addition, we find that making use of the training data in both languages by multi-task training sequence-to-sequence speech translation and recognition models with a shared encoder network can improve performance by a further 1.4 BLEU points.

PDF Abstract

Overview of Sequence-to-Sequence Models for Direct Speech Translation

The paper "Sequence-to-Sequence Models Can Directly Translate Foreign Speech" by Weiss et al. examines the application of sequence-to-sequence (seq2seq) models to the challenging task of directly translating speech from one language into text in another. This is accomplished without explicitly transcribing the speech into text in the source language, or requiring source language transcription during training. The researchers leverage an attention-based recurrent neural network architecture previously utilized in speech recognition, adapting it to perform direct speech-to-text translation across languages. This exposition will discuss the methodology, results, and implications of this work.

Methodology

The authors propose a seq2seq architecture with attention, akin to the Listen, Attend, and Spell (LAS) model, but modified to translate speech without intermediate transcription. The model processes input audio features through a recurrent encoder and applies an attention mechanism to decode the speech into translated text. A single model is trained end-to-end, aiming for joint optimization rather than separate training of individual components typical in cascaded models.

The architecture is primarily composed of:

A recurrent encoder for converting input speech features into hidden representations.
A stacked recurrent decoder that predicts output text directly in the target language.
An attention mechanism to provide context over the entire sequence of input features.

This architecture was trained and evaluated using the Fisher Callhome Spanish-English speech translation task. It demonstrated the capacity to translate directly from Spanish audio to English text, achieving superior performance compared to cascades of individually trained ASR and neural machine translation (NMT) models.

Results

The seq2seq model obtained state-of-the-art performance, delivering an improvement of 1.8 BLEU points over traditional cascaded systems on the Fisher test set. Furthermore, the exploration of multi-task learning, where both speech translation and recognition tasks shared an encoder network, yielded an additional 1.4 BLEU point boost. The end-to-end model exhibited advantages in processing speed and resource efficiency, particularly in low-resource settings where labeled parallel corpora for intermediate transcription are not feasible.

Implications

This research demonstrates the feasibility of using single seq2seq models for complex direct translation tasks, bypassing the traditional intermediate step of source language transcription. This approach offers a reduction in system latency and complexity, suggesting benefits in scenarios where output speed and resource limitations are critical considerations.

The shared encoder employed in multi-task settings implies a promising direction for leveraging unannotated audio data across languages, potentially benefiting low-resource language translation through multilingual training paradigms. Future work could investigate expanding this framework to support multilingual translation, offering a unified model capable of handling multiple language pairs effectively.

In conclusion, the paper by Weiss et al. reports compelling results in direct speech-to-text translation through seq2seq models, opening new possibilities for practical applications and further advancements in neural machine translation and speech recognition integration.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Ron J. Weiss (30 papers)
Jan Chorowski (29 papers)
Navdeep Jaitly (67 papers)
Yonghui Wu (115 papers)
Zhifeng Chen (65 papers)

Citations (329)

View on Semantic Scholar

Sequence-to-Sequence Models Can Directly Translate Foreign Speech (1703.08581v2)

Overview of Sequence-to-Sequence Models for Direct Speech Translation

Methodology

Results

Implications

Related Papers