Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining (2405.09017v1)

Published 15 May 2024 in cs.CL

Abstract: Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical LLMs and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Masaaki Nagata (21 papers)
  2. Makoto Morishita (20 papers)
  3. Katsuki Chousa (7 papers)
  4. Norihito Yasuda (6 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com