Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons (2402.14086v3)

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation LexC-Gen, a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. Through ablation study, we show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects.
  2. Creating and evaluating resources for sentiment analysis in the low-resource language: Sindhi. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 188–194, Online. Association for Computational Linguistics.
  3. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  4. Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983.
  5. Feelings from the Past—Adapting affective lexicons for historical emotion analysis. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 54–61, Osaka, Japan. The COLING 2016 Organizing Committee.
  6. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  7. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  8. Amitava Das and Sivaji Bandyopadhyay. 2010. Sentiwordnet for bangla. Knowledge Sharing Event-4: Task, 2:1–8.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  11. Chris Hokamp and Qun Liu. 2017. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, Vancouver, Canada. Association for Computational Linguistics.
  12. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
  13. Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 839–850, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. GATITOS: Using a new multilingual lexicon for low-resource machine translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 371–405, Singapore. Association for Computational Linguistics.
  15. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  16. Panlex: Building a resource for panlingual lexical translation. In LREC, pages 3145–3150.
  17. Zero-shot sentiment analysis in low-resource languages using a multilingual sentiment lexicon.
  18. Judith F Kroll and Fengyang Ma. 2017. The bilingual lexicon. The handbook of psycholinguistics, pages 294–319.
  19. Dict-nmt: Bilingual dictionary based nmt for extremely low resource languages. arXiv preprint arXiv:2206.04439.
  20. Multilingual sentiment analysis for under-resourced languages: A systematic review of the landscape. IEEE Access.
  21. Cheap translation for cross-lingual named entity recognition. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2536–2545, Copenhagen, Denmark. Association for Computational Linguistics.
  22. Paul Meara. 1993. The bilingual lexicon and the teaching of vocabulary. The bilingual lexicon, pages 279–297.
  23. Idi Mohammed and Rajesh Prasad. 2023. Building lexicon-based sentiment analysis model for low-resource languages. MethodsX, 11:102460.
  24. Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  25. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1–69.
  26. Learning to generate instructions to adapt language models to new tasks. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  27. OpenAI. 2024. Pricing.
  28. Matt Post and David Vilar. 2018. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324, New Orleans, Louisiana. Association for Computational Linguistics.
  29. Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti. 2019. Improving bi-lstm performance for indonesian sentiment analysis using paragraph vector. In 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pages 1–5. IEEE.
  30. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
  31. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592.
  32. Sree Harsha Ramesh and Krishna Prasad Sankaranarayanan. 2018. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 112–119, New Orleans, Louisiana, USA. Association for Computational Linguistics.
  33. Cross-lingual sentiment transfer with limited resources. Machine Translation, 32:143–165.
  34. ChatGPT MT: Competitive for high- (but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392–418, Singapore. Association for Computational Linguistics.
  35. Yves Scherrer and Benoît Sagot. 2013. Lexicon induction and part-of-speech tagging of non-resourced languages without any bilingual resources. In RANLP Workshop on Adaptation of language resources and tools for closely related languages and language variants.
  36. Robert Schreuder and Bert Weltens. 1993. The bilingual lexicon, volume 6. John Benjamins Publishing.
  37. Model dementia: Generated data makes models forget. arXiv e-prints, pages arXiv–2305.
  38. Eduardo Marín Silva. 2021. On the 1978 version of the african reference alphabet.
  39. Aya dataset: An open-access collection for multilingual instruction tuning.
  40. Toward any-language zero-shot topic classification of textual documents. Artificial Intelligence, 274:133–150.
  41. Oliver Streiter and Leonid L Iomdin. 2000. Learning lessons from bilingual corpora: Benefits for machine translation. International journal of corpus linguistics, 5(2):199–230.
  42. HABLex: Human annotated bilingual lexicons for experiments in machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1382–1387, Hong Kong, China. Association for Computational Linguistics.
  43. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863–877, Dublin, Ireland. Association for Computational Linguistics.
  44. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  45. LLM-powered data augmentation for enhanced cross-lingual performance. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 671–686, Singapore. Association for Computational Linguistics.
  46. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
  47. Efficient zero-shot cross-lingual inference via retrieval. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 93–104, Nusa Dua, Bali. Association for Computational Linguistics.
  48. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
  49. Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367.
  50. Low-resource languages jailbreak GPT-4. In Socially Responsible Language Modelling Research.
  51. BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11682–11703, Toronto, Canada. Association for Computational Linguistics.
  52. A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37.
  53. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
  54. Controlled text generation with natural language instructions. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 42602–42613. PMLR.
  55. Aya model: An instruction finetuned open-access multilingual language model.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zheng-Xin Yong (23 papers)
  2. Cristina Menghini (13 papers)
  3. Stephen H. Bach (33 papers)
Citations (2)

Summary

  • The paper introduces a two-stage LexC-Gen framework that conditions HRL data generation on bilingual lexicons before translating to low-resource languages.
  • It employs a cost-effective, GPU-efficient method that yields significant improvements in sentiment analysis and topic classification scores.
  • The approach achieves competitive performance with expert-translated data, offering a scalable solution for bridging the data gap in low-resource NLP.

Generating Data for Extremely Low-Resource Languages with LLMs and Bilingual Lexicons

In the domain of NLP, the scarcity of labeled data represents a significant hindrance to advancements for extremely low-resource languages (LRLs). This paper introduces a novel approach, lexicon-conditioned data generation (LCDG), which leverages LLMs and bilingual lexicons to generate classification task data at scale for such languages.

Methodology and Contributions

The approach of translating labeled data from high-resource languages (HRLs) using bilingual lexicons is not new, but the authors recognize a key issue: existing task data and bilingual lexicons often exhibit low lexical overlap. This mismatch results in suboptimal translation coverage and underutilization of lexicons. To address these challenges, the authors propose LexC-Gen, a two-stage methodology designed to maximize the lexical overlap between task data and bilingual lexicons:

  1. Lexicon-Compatible High-Resource Language Data Generation: LexC-Gen initially uses LLMs to generate high-resource-language task data conditioned on words from bilingual lexicons. This step ensures that the generated data have a high lexical overlap with the lexicon, thereby improving the quality of subsequent translations.
  2. Word-to-Word Translation: Following the generation of lexicon-compatible HRL data, these data are translated into LRLs using the bilingual lexicon through word-to-word substitution.

The efficacy of LexC-Gen was evaluated across 17 extremely low-resource languages on sentiment analysis and topic classification tasks. The results showed that classifiers trained on LexC-Gen generated data exhibited significant improvements, with an average increase of 5.6 points in sentiment analysis and 8.9 points in topic classification over existing lexicon-based methods. Notably, the performance of these classifiers was competitive with those trained on expert-translated gold data, despite the auto-generated nature of the training data.

Key Findings and Implications

  1. Improved Lexical Overlap and Translation Quality: The lexicon-conditioned generation method ensures high lexical overlap, leading to better translation coverage and lexicon utilization. This enhancement directly contributes to the improved performance of LRL classifiers.
  2. Cost-Effectiveness and Practicality: LexC-Gen is computationally efficient, requiring only a single GPU to generate data at scale, making it accessible for researchers with limited computational resources. The cost of generating data using open-access LLMs with permissive licenses (e.g., BLOOMZ) is only a fifth of that required by GPT-4-based methods.
  3. Scalability and Flexibility: The methodology is robust and scalable, capable of generating large volumes of training data swiftly. This scalability is crucial for significantly underrepresented languages where collecting labeled data is otherwise prohibitively difficult.
  4. Cross-Lingual Applications: The ability to generate high-quality synthetic data for LRLs opens new avenues for advancing NLP research and applications in multilingual settings. By improving data availability, LRLs can benefit from advanced NLP techniques historically limited to HRLs.

Future Developments

Given the promising results of LexC-Gen, future research could focus on several exciting directions:

  1. Expanding Task Domains: While the current paper focuses on sentiment analysis and topic classification, evaluating the methodology on other NLP tasks, such as named entity recognition or machine translation, could further validate and expand the utility of LexC-Gen.
  2. Enhancing Translation Accuracy: Integrating linguistic information or contextual data into bilingual lexicons could mitigate issues related to word sense disambiguation, thus refining the translation process and improving data quality.
  3. Exploration of Further LLMs: Additional studies could investigate the performance of other instruction-tuned LLMs or alternatives to BLOOMZ, optimizing for different languages and tasks.
  4. Incorporating Syntactic Structures: Addressing syntactic mismatches between HRLs and LRLs through syntactic transformation techniques could enhance the applicability of LexC-Gen across a wider variety of languages and technical contexts.

In conclusion, the LexC-Gen framework introduces a practical and scalable solution to the data scarcity problem in NLP for low-resource languages, leveraging the strength of LLMs and the breadth of bilingual lexicons. The method not only offers a significant performance boost over traditional lexicon-based methods but also underscores the potential of synthetic data in bridging linguistic disparities. The implications of this research extend beyond immediate performance improvements, highlighting a pathway towards more inclusive and representative linguistic technologies.