Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin (2307.00382v1)

Published 1 Jul 2023 in cs.CL

Abstract: Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained LLMs serve as a stronger prior than multilingual LLMs on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Pin-Jie Lin (10 papers)
  2. Muhammed Saeed (5 papers)
  3. Ernie Chang (33 papers)
  4. Merel Scholman (4 papers)
Citations (5)