On Using Very Large Target Vocabulary for Neural Machine Translation (1412.2007v2)

Published 5 Dec 2014 in cs.CL

Abstract: Neural machine translation, a recently proposed approach to machine translation based purely on neural networks, has shown promising results compared to the existing approaches such as phrase-based statistical machine translation. Despite its recent success, neural machine translation has its limitation in handling a larger vocabulary, as training complexity as well as decoding complexity increase proportionally to the number of target words. In this paper, we propose a method that allows us to use a very large target vocabulary without increasing training complexity, based on importance sampling. We show that decoding can be efficiently done even with the model having a very large target vocabulary by selecting only a small subset of the whole target vocabulary. The models trained by the proposed approach are empirically found to outperform the baseline models with a small vocabulary as well as the LSTM-based neural machine translation models. Furthermore, when we use the ensemble of a few models with very large target vocabularies, we achieve the state-of-the-art translation performance (measured by BLEU) on the English->German translation and almost as high performance as state-of-the-art English->French translation system.

View on arXiv

Authors (4)

Sébastien Jean (12 papers)
Kyunghyun Cho (292 papers)
Roland Memisevic (36 papers)
Yoshua Bengio (601 papers)

Citations (991)

View on Semantic Scholar

Summary

Analyzing Large Target Vocabularies in Neural Machine Translation

The paper "On Using Very Large Target Vocabulary for Neural Machine Translation" addresses the computational challenges and performance limitations associated with Neural Machine Translation (NMT) systems when dealing with extensive target vocabularies. Authored by Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio, the work provides a methodological advancement to handle very large vocabularies without inducing prohibitive computational costs.

Key Contributions and Methodology

The primary advancement elucidated in this paper is the introduction of an approximate training algorithm based on importance sampling. The central challenge in NMT with large vocabularies lies in scaling the computational complexity linearly with the number of target words. Traditional approaches often use a target vocabulary limited to the most frequent words, resulting in significant performance degradation when confronted with rare words. Existing strategies like Noise-Contrastive Estimation (NCE) and hierarchical softmax, although beneficial, do not entirely mitigate this issue, particularly during the decoding phase.

This paper proposes using a biased importance sampling technique to approximate the normalization constant when calculating the target word probabilities. This strategy maintains constant computational complexity relative to the vocabulary size, notably enhancing training efficiency. By partitioning the training corpus and defining smaller subsets of the target vocabulary, they manage memory requirements and leverage the computational power of GPUs more effectively.

Experimental Setup and Results

The empirical evaluation was performed on English-to-French and English-to-German translation tasks using datasets consisting of millions of sentence pairs. Notably, the models were tested against the WMT'14 benchmark.

Key Findings:

Translation Accuracy: The RNNsearch-LV model, with a vocabulary of 500,000 words, demonstrated superior performance compared to models with limited vocabularies. The BLEU scores for English-to-French reached up to 37.19 with model ensembles, indicating results comparable to state-of-the-art systems.
Decoding Efficiency: Using a candidate list approach for decoding vastly improved the speed while maintaining high translation accuracy. This method dynamically filters the target vocabulary to a manageable size during translation, ensuring computational feasibility.
Handling Unknown Words: By aligning source and target words using a dictionary during the decoding phase, the system could make educated guesses about unknown words, further ameliorating translation quality.

Practical and Theoretical Implications

The implications of this research are multifaceted:

Practical Implementation: The proposed sampling method allows for the practical deployment of NMT systems with significantly larger vocabularies without the previously insurmountable computational burden. This advancement is particularly pertinent for languages with extensive lexicons, where traditional methods falter due to frequent unknown words.
Future Developments: The establishment of constant-time complexity with respect to vocabulary size opens avenues for further optimizations in NMT architectures, potentially leading to real-time applications with even larger vocabularies or more fine-grained contextual word choices.

Conclusion

The methodological innovations introduced in this paper demonstrate a tangible route toward overcoming the limitations of target vocabulary size in neural machine translation. By coupling importance sampling with strategic corpus partitioning, the authors present a robust framework that harmonizes computational efficiency with linguistic accuracy. Future research can build upon these foundations to explore even more sophisticated mechanisms for dealing with extensive vocabularies, paving the way for increasingly accurate and operationally viable NMT systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ruyimarone/status/1888968235384479935