Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages (2404.06138v2)
Abstract: LLMs show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.
- Do all languages cost the same? tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9904–9923, Singapore. Association for Computational Linguistics.
- MEGA: Multilingual evaluation of generative AI. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Buffet: Benchmarking large language models for few-shot cross-lingual transfer. arXiv preprint arXiv:2305.14857.
- Constitutional ai: Harmlessness from ai feedback.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
- Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505.
- Language models are few-shot learners.
- Nusacrowd: Open source initiative for indonesian nlp resources.
- Nusawrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages.
- InstructAlign: High-and-low resource language alignment via continual crosslingual instruction tuning. In Proceedings of the First Workshop in South East Asian Language Processing, pages 55–78, Nusa Dua, Bali, Indonesia. Association for Computational Linguistics.
- IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Parsing with multilingual BERT, a small corpus, and a small treebank. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1324–1334, Online. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Multilingual jailbreak challenges in large language models.
- Latent hatred: A benchmark for understanding implicit hate speech. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 345–363, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
- ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
- An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134, Toronto, Canada. Association for Computational Linguistics.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Multi-lingual and multi-cultural figurative language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8269–8284, Toronto, Canada. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
- Cloze evaluation for deeper understanding of commonsense stories in Indonesian. In Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022), pages 8–16, Dublin, Ireland. Association for Computational Linguistics.
- IndoBERTweet: A pretrained language model for Indonesian Twitter with effective domain-specific vocabulary initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10660–10668, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, pages 757–770, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
- Bhasa: A holistic southeast asian linguistic and cultural evaluation suite for large language models. arXiv preprint arXiv:2309.06085.
- Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings.
- LogiCoT: Logical chain-of-thought instruction tuning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2908–2921, Singapore. Association for Computational Linguistics.
- The flan collection: Designing data and methods for effective instruction tuning.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Seallms–large language models for southeast asia. arXiv preprint arXiv:2312.00738.
- OpenAI. 2023a. Chatgpt.
- OpenAI. 2023b. Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Inexpensive domain adaptation of pretrained language models: Case studies on biomedical NER and covid-19 QA. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1482–1490, Online. Association for Computational Linguistics.
- Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1).
- Zero: Memory optimizations toward training trillion parameter models.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
- AI Singapore. 2023. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion.
- exBERT: Extending pre-trained models with domain-specific vocabulary under constrained training resources. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1433–1439, Online. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Openchat: Advancing open-source language models with mixed-quality data.
- All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
- Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018.
- Copal-id: Indonesian language reasoning with local culture and nuances. arXiv preprint arXiv:2311.01012.
- IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
- Edward O Wilson. 2017. The origins of creativity. Liveright Publishing.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Low-resource languages jailbreak gpt-4.
- M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. arXiv preprint arXiv:2306.05179.
- Samuel Cahyawijaya (75 papers)
- Holy Lovenia (30 papers)
- Fajri Koto (47 papers)
- Rifki Afina Putri (8 papers)
- Emmanuel Dave (2 papers)
- Jhonson Lee (2 papers)
- Nuur Shadieq (2 papers)
- Wawan Cenggoro (1 paper)
- Salsabil Maulana Akbar (3 papers)
- Muhammad Ihza Mahendra (1 paper)
- Dea Annisayanti Putri (1 paper)
- Bryan Wilie (24 papers)
- Genta Indra Winata (94 papers)
- Alham Fikri Aji (94 papers)
- Ayu Purwarianti (39 papers)
- Pascale Fung (151 papers)