Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data (2106.06875v1)

Published 12 Jun 2021 in cs.CL

Abstract: High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that humans pay specific attention to movements, we use graphics interchange formats (GIFs) as a pivot to collect parallel sentences from monolingual annotators. We use our strategy to collect data in Hindi, Tamil and English. As a baseline, we also collect data using images as a pivot. We perform an intrinsic evaluation by manually evaluating a subset of the sentence pairs and an extrinsic evaluation by finetuning mBART on the collected data. We find that sentences collected via GIFs are indeed of higher quality.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rajat Bhatnagar (5 papers)
  2. Ananya Ganesh (5 papers)
  3. Katharina Kann (50 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.