Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building the Language Resource for a Cebuano-Filipino Neural Machine Translation System (2110.15716v1)

Published 5 Oct 2021 in cs.CL

Abstract: Parallel corpus is a critical resource in machine learning-based translation. The task of collecting, extracting, and aligning texts in order to build an acceptable corpus for doing the translation is very tedious most especially for low-resource languages. In this paper, we present the efforts made to build a parallel corpus for Cebuano and Filipino from two different domains: biblical texts and the web. For the biblical resource, subword unit translation for verbs and copy-able approach for nouns were applied to correct inconsistencies in the translation. This correction mechanism was applied as a preprocessing technique. On the other hand, for Wikipedia being the main web resource, commonly occurring topic segments were extracted from both the source and the target languages. These observed topic segments are unique in 4 different categories. The identification of these topic segments may be used for the automatic extraction of sentences. A Recurrent Neural Network was used to implement the translation using OpenNMT sequence modeling tool in TensorFlow. The two different corpora were then evaluated by using them as two separate inputs in the neural network. Results have shown a difference in BLEU scores in both corpora.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Kristine Mae Adlaon (1 paper)
  2. Nelson Marcos (2 papers)
Citations (6)