scb-mt-en-th-2020: A Large English-Thai Parallel Corpus (2007.03541v1)

Published 7 Jul 2020 in cs.CL

Abstract: The primary objective of our work is to build a large-scale English-Thai dataset for machine translation. We construct an English-Thai machine translation dataset with over 1 million segment pairs, curated from various sources, namely news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data and government documents. Methodology for gathering data, building parallel texts and removing noisy sentence pairs are presented in a reproducible manner. We train machine translation models based on this dataset. Our models' performance are comparable to that of Google Translation API (as of May 2020) for Thai-English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai-English and English-Thai translation. The dataset, pre-trained models, and source code to reproduce our work are available for public use.

View on arXiv

Authors (4)

Lalita Lowphansirikul (4 papers)
Charin Polpanumas (6 papers)
Attapol T. Rutherford (6 papers)
Sarana Nutanong (14 papers)

Citations (20)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus (2007.03541v1)

Summary

Related Papers