The IIT Bombay English-Hindi Parallel Corpus (1710.02855v2)

Published 8 Oct 2017 in cs.CL

Abstract: We present the IIT Bombay English-Hindi Parallel Corpus. The corpus is a compilation of parallel corpora previously available in the public domain as well as new parallel corpora we collected. The corpus contains 1.49 million parallel segments, of which 694k segments were not previously available in the public domain. The corpus has been pre-processed for machine translation, and we report baseline phrase-based SMT and NMT translation results on this corpus. This corpus has been used in two editions of shared tasks at the Workshop on Asian Language Translation (2016 and 2017). The corpus is freely available for non-commercial research. To the best of our knowledge, this is the largest publicly available English-Hindi parallel corpus.

Citations (240)

View on Semantic Scholar

Summary

The paper introduces a groundbreaking English-Hindi parallel corpus, featuring 1.49M segments with 694K new additions.
The paper details a sophisticated sentence alignment method achieving an 88.6% precision rate across diverse domains.
The baseline SMT and NMT systems report BLEU scores of 11.75 and 12.23, offering clear benchmarks for future improvements.

An In-depth Analysis of the IIT Bombay English-Hindi Parallel Corpus

The paper presents the IIT Bombay English-Hindi Parallel Corpus, a large-scale dataset designed to support machine translation endeavors between English and Hindi. Comprising 1.49 million parallel segments, the compilation stands as the largest publicly available English-Hindi parallel corpus to date, integrating both pre-existing and newly collected parallel corpora. Significantly, the corpus adds 694,000 new segments that were previously unavailable in the public domain, thereby bolstering the resources available for developing and benchmarking machine translation systems.

Dataset Composition and Characteristics

The corpus aggregates data from multiple domains and sources, including both open-source repositories such as OPUS and unique datasets developed by the authors. The new data subsets, including Judicial domain corpora, Mahashabdkosh, and Indian Government corpora, provide diverse content areas that enhance the depth of the corpus. The corpus compilation method also involves a sophisticated sentence alignment process to extract parallel corpora from comparable datasets such as the Gyaan-Nidhi Corpus, achieving an impressive precision rate of 88.6% for sentence alignment.

Baseline Machine Translation Systems

The paper offers baseline performance results employing both Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) frameworks. For SMT, the authors use Moses with the grow-diag-final-and heuristic for phrase extraction, alongside sophisticated tuning and model estimation techniques. The NMT system is based on the Nematus toolkit, utilizing a subword-level encoder-decoder architecture with attention mechanisms. Byte Pair Encoding (BPE) is employed to manage the complexities introduced by differing scripts and vocabularies of the two languages.

Despite the challenges presented by the linguistic diversity and morphological richness of Hindi, the baseline results display moderate performance with BLEU scores of 11.75 for SMT and 12.23 for NMT in the English-to-Hindi direction. These metrics provide a valuable point of comparison for future research aiming to advance machine translation capabilities for these languages.

Implications and Future Directions

The release of this dataset holds significant implications for both practical and theoretical developments in machine translation. On a practical level, the availability of such a comprehensive corpus allows for the training and evaluation of more sophisticated translation models. Theoretically, the dataset offers a fertile ground for exploring domain adaptation, transfer learning, and other advanced techniques in cross-linguistic contexts.

Looking forward, the authors outline plans to expand the corpus further, particularly by incorporating content from Indian government web domains. This would not only increase the corpus size but could also increase domain diversity, providing additional advantages for model training and evaluation. The authors also propose enhancements to baseline systems, such as pre-ordering strategies in PBSMT and leveraging back-translation techniques in NMT, which could yield significant improvements in translation quality.

Conclusion

The IIT Bombay English-Hindi Parallel Corpus marks a significant addition to the resources available for computational linguistics focused on Indian languages. Its comprehensive nature, combined with the authors' commitment to open access and continued development, ensures that it will serve as both a benchmark and a catalyst for advancing English-Hindi machine translation. As the corpus evolves with further enhancements, it is poised to facilitate more nuanced and effective translation systems, ultimately contributing to better digital communication across linguistic boundaries.

PDF Markdown