LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons (2402.14086v3)
Abstract: Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation LexC-Gen, a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. Through ablation study, we show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen serves as a potential solution to close the performance gap between open-source multilingual models, such as BLOOMZ and Aya-101, and state-of-the-art commercial models like GPT-4o on low-resource-language tasks.
- Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects.
- Creating and evaluating resources for sentiment analysis in the low-resource language: Sindhi. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 188–194, Online. Association for Computational Linguistics.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Building machine translation systems for the next thousand languages. arXiv preprint arXiv:2205.03983.
- Feelings from the Past—Adapting affective lexicons for historical emotion analysis. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 54–61, Osaka, Japan. The COLING 2016 Organizing Committee.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Amitava Das and Sivaji Bandyopadhyay. 2010. Sentiwordnet for bangla. Knowledge Sharing Event-4: Task, 2:1–8.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Chris Hokamp and Qun Liu. 2017. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, Vancouver, Canada. Association for Computational Linguistics.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada. Association for Computational Linguistics.
- Improved lexically constrained decoding for translation and monolingual rewriting. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 839–850, Minneapolis, Minnesota. Association for Computational Linguistics.
- GATITOS: Using a new multilingual lexicon for low-resource machine translation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 371–405, Singapore. Association for Computational Linguistics.
- The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Panlex: Building a resource for panlingual lexical translation. In LREC, pages 3145–3150.
- Zero-shot sentiment analysis in low-resource languages using a multilingual sentiment lexicon.
- Judith F Kroll and Fengyang Ma. 2017. The bilingual lexicon. The handbook of psycholinguistics, pages 294–319.
- Dict-nmt: Bilingual dictionary based nmt for extremely low resource languages. arXiv preprint arXiv:2206.04439.
- Multilingual sentiment analysis for under-resourced languages: A systematic review of the landscape. IEEE Access.
- Cheap translation for cross-lingual named entity recognition. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2536–2545, Copenhagen, Denmark. Association for Computational Linguistics.
- Paul Meara. 1993. The bilingual lexicon and the teaching of vocabulary. The bilingual lexicon, pages 279–297.
- Idi Mohammed and Rajesh Prasad. 2023. Building lexicon-based sentiment analysis model for low-resource languages. MethodsX, 11:102460.
- Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
- Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):1–69.
- Learning to generate instructions to adapt language models to new tasks. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- OpenAI. 2024. Pricing.
- Matt Post and David Vilar. 2018. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324, New Orleans, Louisiana. Association for Computational Linguistics.
- Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti. 2019. Improving bi-lstm performance for indonesian sentiment analysis using paragraph vector. In 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pages 1–5. IEEE.
- Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592.
- Sree Harsha Ramesh and Krishna Prasad Sankaranarayanan. 2018. Neural machine translation for low resource languages using bilingual lexicon induced from comparable corpora. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 112–119, New Orleans, Louisiana, USA. Association for Computational Linguistics.
- Cross-lingual sentiment transfer with limited resources. Machine Translation, 32:143–165.
- ChatGPT MT: Competitive for high- (but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392–418, Singapore. Association for Computational Linguistics.
- Yves Scherrer and Benoît Sagot. 2013. Lexicon induction and part-of-speech tagging of non-resourced languages without any bilingual resources. In RANLP Workshop on Adaptation of language resources and tools for closely related languages and language variants.
- Robert Schreuder and Bert Weltens. 1993. The bilingual lexicon, volume 6. John Benjamins Publishing.
- Model dementia: Generated data makes models forget. arXiv e-prints, pages arXiv–2305.
- Eduardo Marín Silva. 2021. On the 1978 version of the african reference alphabet.
- Aya dataset: An open-access collection for multilingual instruction tuning.
- Toward any-language zero-shot topic classification of textual documents. Artificial Intelligence, 274:133–150.
- Oliver Streiter and Leonid L Iomdin. 2000. Learning lessons from bilingual corpora: Benefits for machine translation. International journal of corpus linguistics, 5(2):199–230.
- HABLex: Human annotated bilingual lexicons for experiments in machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1382–1387, Hong Kong, China. Association for Computational Linguistics.
- Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863–877, Dublin, Ireland. Association for Computational Linguistics.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- LLM-powered data augmentation for enhanced cross-lingual performance. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 671–686, Singapore. Association for Computational Linguistics.
- IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
- Efficient zero-shot cross-lingual inference via retrieval. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 93–104, Nusa Dua, Bali. Association for Computational Linguistics.
- NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
- Genie: Achieving human parity in content-grounded datasets generation. arXiv preprint arXiv:2401.14367.
- Low-resource languages jailbreak GPT-4. In Socially Responsible Language Modelling Research.
- BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11682–11703, Toronto, Canada. Association for Computational Linguistics.
- A survey of controllable text generation using transformer-based pre-trained language models. ACM Computing Surveys, 56(3):1–37.
- LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems.
- Controlled text generation with natural language instructions. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 42602–42613. PMLR.
- Aya model: An instruction finetuned open-access multilingual language model.
- Zheng-Xin Yong (23 papers)
- Cristina Menghini (13 papers)
- Stephen H. Bach (33 papers)