A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation (1603.06147v4)

Published 19 Mar 2016 in cs.CL and cs.LG

Abstract: The existing machine translation systems, whether phrase-based or neural, have relied almost exclusively on word-level modelling with explicit segmentation. In this paper, we ask a fundamental question: can neural machine translation generate a character sequence without any explicit segmentation? To answer this question, we evaluate an attention-based encoder-decoder with a subword-level encoder and a character-level decoder on four language pairs--En-Cs, En-De, En-Ru and En-Fi-- using the parallel corpora from WMT'15. Our experiments show that the models with a character-level decoder outperform the ones with a subword-level decoder on all of the four language pairs. Furthermore, the ensembles of neural models with a character-level decoder outperform the state-of-the-art non-neural machine translation systems on En-Cs, En-De and En-Fi and perform comparably on En-Ru.

Authors (3)

Junyoung Chung (10 papers)
Kyunghyun Cho (292 papers)
Yoshua Bengio (601 papers)

Citations (330)

View on Semantic Scholar

Summary

A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation

The paper presents an exploration into the capacity of neural machine translation (NMT) models to perform at the character level without the need for explicit segmentation, a departure from the traditional word- or subword-level approaches. Notably, the paper utilizes an attention-based encoder-decoder architecture, with a subword-level encoder and a character-level decoder, across four language pairs from WMT'15: English-Czech, English-German, English-Russian, and English-Finnish.

Key Findings

Translation Quality: Empirical results indicate that models utilizing a character-level decoder outperform those employing a subword-level decoder across all tested language pairs. The character-level approach also surpasses state-of-the-art non-neural translation systems for English-Czech, English-German, and English-Finnish, and shows comparable performance on English-Russian.
Model Architecture: The research explores two configurations for the character-level decoder:
1. A stacked recurrent neural network (RNN) using gated recurrent units (GRUs).
2. A newly proposed bi-scale recurrent network, designed to handle multiple timescales in sequence data effectively.

The bi-scale configuration, which balances fast and slow processing layers, showed nuanced improvements over the base RNN setup in some instances, although both configurations proved viable for character-level translation.

Implications and Future Directions

Addressing Data Sparsity: The paper provides evidence that challenges associated with data sparsity, exacerbated by character-level sequences, can be effectively mitigated with neural network models' parametric approaches. This contrasts with traditional, non-parametric, count-based systems that suffer from exponential growth in state spaces.
Potential for Morphological Variants: The character-level approach holds promise for more effectively handling morphological variants, a significant advantage in machine translation tasks involving morphologically rich languages.
Impact on Alignment Mechanisms: An analysis of soft-alignments demonstrates that even at the character level, these models can accurately align between source subwords and target character sequences, underscoring the robustness of attention mechanisms in such granular translation tasks.
Further Research Opportunities: While the paper focuses on character-level decoding with subword-encoded source sequences, the findings lay groundwork for future exploration into full character-level translation on both source and target sides. Such research could further demonstrate character-level translation's viability and practical utility in NMT.

Conclusion

The research challenges the prevailing assumption that word-level segmentation is imperative for effective machine translation, instead highlighting the potential benefits and feasibility of character-level approaches. These findings open new vistas for NMT systems, indicating a shift in how linguistic data can be processed and translated without the prerequisite of explicit segmentation, thus potentially simplifying the pipeline for machine translation models and broadening their applicability across diverse languages and scripts.

Related Papers

Find Related Papers