Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

104 16

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons (2402.14086v3)

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation LexC-Gen, a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. Through ablation study, we show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.

References (55)

Authors (3)

Zheng-Xin Yong (23 papers)
Cristina Menghini (13 papers)
Stephen H. Bach (33 papers)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a two-stage LexC-Gen framework that conditions HRL data generation on bilingual lexicons before translating to low-resource languages.
It employs a cost-effective, GPU-efficient method that yields significant improvements in sentiment analysis and topic classification scores.
The approach achieves competitive performance with expert-translated data, offering a scalable solution for bridging the data gap in low-resource NLP.

Generating Data for Extremely Low-Resource Languages with LLMs and Bilingual Lexicons

In the domain of NLP, the scarcity of labeled data represents a significant hindrance to advancements for extremely low-resource languages (LRLs). This paper introduces a novel approach, lexicon-conditioned data generation (LCDG), which leverages LLMs and bilingual lexicons to generate classification task data at scale for such languages.

Methodology and Contributions

The approach of translating labeled data from high-resource languages (HRLs) using bilingual lexicons is not new, but the authors recognize a key issue: existing task data and bilingual lexicons often exhibit low lexical overlap. This mismatch results in suboptimal translation coverage and underutilization of lexicons. To address these challenges, the authors propose LexC-Gen, a two-stage methodology designed to maximize the lexical overlap between task data and bilingual lexicons:

Lexicon-Compatible High-Resource Language Data Generation: LexC-Gen initially uses LLMs to generate high-resource-language task data conditioned on words from bilingual lexicons. This step ensures that the generated data have a high lexical overlap with the lexicon, thereby improving the quality of subsequent translations.
Word-to-Word Translation: Following the generation of lexicon-compatible HRL data, these data are translated into LRLs using the bilingual lexicon through word-to-word substitution.

The efficacy of LexC-Gen was evaluated across 17 extremely low-resource languages on sentiment analysis and topic classification tasks. The results showed that classifiers trained on LexC-Gen generated data exhibited significant improvements, with an average increase of 5.6 points in sentiment analysis and 8.9 points in topic classification over existing lexicon-based methods. Notably, the performance of these classifiers was competitive with those trained on expert-translated gold data, despite the auto-generated nature of the training data.

Key Findings and Implications

Improved Lexical Overlap and Translation Quality: The lexicon-conditioned generation method ensures high lexical overlap, leading to better translation coverage and lexicon utilization. This enhancement directly contributes to the improved performance of LRL classifiers.
Cost-Effectiveness and Practicality: LexC-Gen is computationally efficient, requiring only a single GPU to generate data at scale, making it accessible for researchers with limited computational resources. The cost of generating data using open-access LLMs with permissive licenses (e.g., BLOOMZ) is only a fifth of that required by GPT-4-based methods.
Scalability and Flexibility: The methodology is robust and scalable, capable of generating large volumes of training data swiftly. This scalability is crucial for significantly underrepresented languages where collecting labeled data is otherwise prohibitively difficult.
Cross-Lingual Applications: The ability to generate high-quality synthetic data for LRLs opens new avenues for advancing NLP research and applications in multilingual settings. By improving data availability, LRLs can benefit from advanced NLP techniques historically limited to HRLs.

Future Developments

Given the promising results of LexC-Gen, future research could focus on several exciting directions:

Expanding Task Domains: While the current paper focuses on sentiment analysis and topic classification, evaluating the methodology on other NLP tasks, such as named entity recognition or machine translation, could further validate and expand the utility of LexC-Gen.
Enhancing Translation Accuracy: Integrating linguistic information or contextual data into bilingual lexicons could mitigate issues related to word sense disambiguation, thus refining the translation process and improving data quality.
Exploration of Further LLMs: Additional studies could investigate the performance of other instruction-tuned LLMs or alternatives to BLOOMZ, optimizing for different languages and tasks.
Incorporating Syntactic Structures: Addressing syntactic mismatches between HRLs and LRLs through syntactic transformation techniques could enhance the applicability of LexC-Gen across a wider variety of languages and technical contexts.

In conclusion, the LexC-Gen framework introduces a practical and scalable solution to the data scarcity problem in NLP for low-resource languages, leveraging the strength of LLMs and the breadth of bilingual lexicons. The method not only offers a significant performance boost over traditional lexicon-based methods but also underscores the potential of synthetic data in bridging linguistic disparities. The implications of this research extend beyond immediate performance improvements, highlighting a pathway towards more inclusive and representative linguistic technologies.

PDF Markdown

GitHub

Tweets

https://twitter.com/arankomatsuzaki/status/1760867635003023832

https://twitter.com/yong_zhengxin/status/1845302254150303973

https://twitter.com/_akhaliq/status/1760858001882091674

https://twitter.com/yong_zhengxin/status/1760862112283234718

https://twitter.com/arxivsanitybot/status/1761381880086016204

https://twitter.com/knishimae0531/status/1760872813064491210