Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing (2105.08807v1)

Published 18 May 2021 in cs.CL

Abstract: We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained LLMs, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving LLM performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the LLMs on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ganesh Jawahar (11 papers)
  2. El Moatez Billah Nagoudi (31 papers)
  3. Muhammad Abdul-Mageed (102 papers)
  4. Laks V. S. Lakshmanan (58 papers)
Citations (28)

Summary

We haven't generated a summary for this paper yet.