Neural Machine Translation of Rare Words with Subword Units (1508.07909v5)

Published 31 Aug 2015 in cs.CL

Abstract: Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm, and empirically show that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.1 and 1.3 BLEU, respectively.

PDF Abstract

Summary of "Neural Machine Translation of Rare Words with Subword Units"

The paper "Neural Machine Translation of Rare Words with Subword Units" addresses the critical issue of handling rare and out-of-vocabulary (OOV) words in neural machine translation (NMT) systems. Traditional NMT models often depend on a fixed vocabulary size, which can range between 30,000 and 50,000 words. This fixed-size approach contrasts with the nature of human languages, where new words continuously enter the lexicon, necessitating solutions that allow for open-vocabulary translation.

Key Contributions

The paper makes two primary contributions to the field of NMT:

Subword Encoding for Open-Vocabulary Translation:
- The authors propose a method to handle rare and OOV words by breaking them down into subword units. This is based on the insight that several types of words, including names, compound words, and loanwords, can be effectively translated by treating them as smaller subword units.
Byte Pair Encoding Adaptation:
- They adapt the Byte Pair Encoding (BPE) algorithm, traditionally used for data compression, for word segmentation. This adaptation allows the creation of a fixed-size subword vocabulary capable of representing an open vocabulary. By iteratively merging the most frequent pairs of characters or character sequences, BPE can produce a compact representation of text, significantly enhancing the efficiency and capacity of NMT models.

Methodology

The paper follows the NMT architecture introduced by Bahdanau et al. (2014), implementing an encoder-decoder model with recurrent neural networks. The encoder uses bidirectional neural networks to create annotation vectors, while the decoder generates the translated output based on these vectors and previously predicted words.

Segmentation Techniques

Several segmentation techniques are evaluated to determine their effectiveness in terms of vocabulary and text size, and their impact on translation quality:

Character n-grams: Simple character-based models that vary the granularity from unigrams to trigrams.
Traditional SMT methods: Approaches like frequency-based compound splitting, rule-based hyphenation, and the Morfessor algorithm.
Proposed BPE Models: Both independent BPE models (separate source and target vocabularies) and joint BPE models (combined source and target vocabularies).

Experimental Results

Experiments were conducted on the WMT 2015 English-to-German and English-to-Russian translation tasks. Performance metrics included BLEU scores, chrF3 scores, and unigram F1 scores, with a particular focus on rare and OOV words. Key findings are:

Translation of Rare Words: Subword models outperformed traditional word-based models with back-off dictionaries. For example, BPE-based models showed an improvement of 1.3 BLEU points over the baseline for English-to-German translations.
Efficiency and Vocabulary Size: BPE models provided a balance between vocabulary size and text length, showing that an optimized subword segmentation strategy could reduce sparsity and enhance translation quality even with a smaller vocabulary size.
Consistency in Segmentation: Joint BPE models performed better than independent BPE models by improving the consistency of subword units across source and target languages, aiding in the learning of mappings between subword units.

Implications and Future Directions

The results demonstrate that subword units can significantly improve the translation of rare and OOV words in NMT systems, making subword models an effective substitute for large vocabularies and back-off dictionaries. This approach can be especially beneficial for languages with rich morphological structures, such as agglutinative or compounding languages.

The authors suggest that future work could focus on:

Optimal Vocabulary Size: Automatically determining the best vocabulary size for specific translation tasks and datasets.
Enhanced Subword Alignment: Developing bilingual segmentation algorithms to create more alignable and meaningful subword units.
Further Model Improvements: Incorporating advancements like dropout and better ensemble techniques to further enhance NMT performance.

Conclusion

In summary, the paper contributes to the ongoing development of NMT systems by demonstrating that subword units can effectively address the open-vocabulary problem in translation tasks. By leveraging BPE for word segmentation, this research simplifies the translation pipeline while improving accuracy, especially in the context of rare word translation. The results indicate promising directions for future work, including the refinement of subword segmentation and the exploration of even more compact and efficient translation models.