Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages (2404.06138v2)

Published 9 Apr 2024 in cs.CL

Abstract: LLMs show remarkable human-like capability in various domains and languages. However, a notable quality gap arises in low-resource languages, e.g., Indonesian indigenous languages, rendering them ineffective and inefficient in such linguistic contexts. To bridge this quality gap, we introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures across a range of model sizes. We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize to unseen tasks and indigenous languages of Indonesia. Furthermore, Cendol models showcase improved human favorability despite their limitations in capturing indigenous knowledge and cultural values in Indonesia. In addition, we discuss the shortcomings of parameter-efficient tunings, such as LoRA, for language adaptation. Alternatively, we propose the usage of vocabulary adaptation to enhance efficiency. Lastly, we evaluate the safety of Cendol and showcase that safety in pre-training in one language such as English is transferable to low-resource languages, such as Indonesian, even without RLHF and safety fine-tuning.

References (61)

Authors (16)

Samuel Cahyawijaya (75 papers)
Holy Lovenia (30 papers)
Fajri Koto (47 papers)
Rifki Afina Putri (8 papers)
Emmanuel Dave (2 papers)
Jhonson Lee (2 papers)
Nuur Shadieq (2 papers)
Wawan Cenggoro (1 paper)
Salsabil Maulana Akbar (3 papers)
Muhammad Ihza Mahendra (1 paper)
Dea Annisayanti Putri (1 paper)
Bryan Wilie (24 papers)
Genta Indra Winata (94 papers)
Alham Fikri Aji (94 papers)
Ayu Purwarianti (39 papers)
Pascale Fung (151 papers)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a suite of LLMs ranging from 300M to 13B parameters, developed using an extensive 53.5M prompt instruction-tuning corpus for Indonesian languages.
It employs both decoder-only and encoder-decoder architectures, achieving a 20% improvement in weighted F1-score on various NLU and NLG tasks.
The study highlights parameter-efficient tuning and transferable safety features from pre-training, while noting challenges in cultural commonsense understanding.

Overview of "Cendol: Open Instruction-tuned Generative LLMs for Indonesian Languages"

This paper presents Cendol, a collection of LLMs specifically tailored for Indonesian and its indigenous languages. Given Indonesia's linguistic diversity and the limited representation of its languages in current multilingual LLMs, this research seeks to address the quality gap by offering a series of instruction-tuned LLMs.

Introduction and Motivation

Indonesia is an archipelago with a significant need for NLP services given its population size and internet usage. However, mainstream LLMs like ChatGPT and GPT-4 offer limited support for its local languages due to poor language representation. As a solution, the authors introduce Cendol, which leverages both decoder-only and encoder-decoder architectures. Cendol is developed to handle the rich linguistic and cultural diversity within Indonesia, enhancing the language understanding and generation capabilities for underrepresented languages.

Key Contributions

Model Architecture and Scale: Cendol models come in various architectural forms (decoder-only and encoder-decoder), spread across a range of scales from 300M to 13B parameters, emphasizing inclusivity of both small and large model categories.
Instruction-Tuning: A significant part of the paper focuses on the development of a robust instruction-tuning corpus, Cendol Collection, which encompasses 53.5 million prompts across 10 languages and 23 tasks, ensuring diverse and comprehensive grooming of the model.
Parameter Efficiency vs. Performance: The research includes an assessment of parameter-efficient tuning methods such as LoRA for the adaptation of large models, highlighting these methods' limitations in achieving quality outcomes compared to full fine-tuning of models.
Safety Evaluation and Transferability: A critical observation is that safety features embedded during the pre-training phase of models like LLaMA are transferable to low-resource contexts such as Indonesian, even without explicit reinforcement learning with human feedback (RLHF).

Evaluation and Results

The Cendol models were evaluated on multiple scales and across different natural language processing tasks:

NLU and NLG Performance: The models provided significant improvements (around 20% weighted F1-score) compared to existing multilingual and regional LLMs across both understanding and generation tasks.
Task and Language Generalization: The generalization to unseen tasks and languages was assessed, highlighting the models’ ability to adapt despite a noted performance drop in unsupervised domains.
Cultural Knowledge and Commonsense Understanding: While excelling in language tasks, Cendol still exhibited a performance gap in tasks related to cultural commonsense and local knowledge, underpinning the challenge of embedding deep cultural understanding in LLMs.

Implications and Future Work

The creation of Cendol stands as a pivotal contribution to the advancement of low-resource language technology by demonstrating the potential that LLMs have when customized beyond the scope of the most spoken global languages. However, the work acknowledges the necessity for further alignment with human values through approaches like RLHF and cultural representation through enhanced training datasets. The research underscores the importance of continuity in such efforts to achieve inclusive and effective AI applications for all linguistic communities. Future developments might focus on integrating improved human-centric feedback mechanisms and expanding datasets to encapsulate richer cultural contexts.

PDF Markdown

Tweets

https://twitter.com/gentaiscool/status/1822423191211663540

https://twitter.com/gentaiscool/status/1791104969585684780