Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback (2307.16039v2)

Published 29 Jul 2023 in cs.CL and cs.LG

Abstract: A key technology for the development of LLMs involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at https://github.com/nlp-uoregon/Okapi.

Okapi: Multilingual Instruction-Tuned LLMs Leveraging RLHF

The paper “Okapi: Instruction-tuned LLMs in Multiple Languages with Reinforcement Learning from Human Feedback” introduces a significant advancement in the field of multilingual NLP by presenting Okapi, a novel system that effectively instruction-tunes LLMs using reinforcement learning from human feedback (RLHF) across multiple languages. This research addresses a substantial gap in the development of multilingual LLMs by incorporating RLHF, a technique that has been predominantly utilized for commercial English-centric LLMs, into the development of open-source LLMs for a wide range of languages.

Methodology

The paper elucidates a comprehensive methodology consisting of three principal stages:

  1. Supervised Fine-Tuning (SFT): The baseline multilingual pre-trained LLMs, such as BLOOM and LLaMA, are first fine-tuned using a supervised learning approach across 158,000 curated instruction data encompassing both Alpaca data and additional generated instructions. This step aims to enhance the model's response alignment with human expectations.
  2. Reward Model Training: A reward model is trained using ranked response outputs obtained from the SFT-tuned model. By employing contrastive learning and leveraging human feedback, this model is designed to provide quantitative metrics assessing the correctness, coherence, and naturalness of responses.
  3. Reinforcement Learning with Human Feedback (RLHF): Building upon the SFT-tuned model, reinforcement learning is applied, guided by the reward model, to further optimize response generation. This stage aims to refine the alignment of LLM responses with nuanced human preferences beyond explicit positive examples.

Experimentation and Evaluation

The Okapi framework was evaluated across 26 languages, encompassing high, medium, and low-resource languages. These evaluations were conducted using datasets such as ARC, HellaSwag, and MMLU, and benchmarked against models like BLOOMZ, which represents extensive instruction tuning across various languages.

The results illustrate the superior performance of the RLHF-based instruction-tuned models over both traditional SFT models and baseline models without tuning. Notably, RLHF models showed improvements of up to 2.5% in average performance across tasks and languages in comparison to their SFT counterparts.

Implications

This paper's key contribution lies in demonstrating the efficacy of RLHF for multilingual instruction tuning, highlighting the potential of this supervised approach to improve LLM performance across linguistically diverse contexts. The significant improvement over baseline models underscores the value of human-aligned feedback in crafting more responsive and contextually accurate LLMs.

Future Directions

The paper suggests several avenues for future research:

  • Extension to More Languages: Incorporating additional languages, specifically those that are resource-scarce, could further enhance the accessibility and applicability of multilingual LLMs globally.
  • Investigating Other Multilingual Models: Expanding the approach to other architectures such as mT5 could present a broader understanding of multilingual LLM versatility.
  • Enhancing Data Quality: Integrating human-evaluated or generated data to ensure higher fidelity in instruction data for RLHF, potentially minimizing the noise induced by automated processes.
  • Addressing Other Linguistic Challenges: Beyond performance, exploring aspects like LLM bias, toxicity, and hallucination in multilingual frameworks would offer a more holistic appreciation of model behavior across different cultural and linguistic fabrics.

In summary, the Okapi framework represents a significant step toward refining multilingual LLMs, providing a scalable approach for fine-tuning models with human-centered feedback, ultimately paving the way for more nuanced and effective multilingual applications in natural language processing.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Viet Dac Lai (25 papers)
  2. Chien Van Nguyen (6 papers)
  3. Nghia Trung Ngo (8 papers)
  4. Thuat Nguyen (2 papers)
  5. Franck Dernoncourt (161 papers)
  6. Ryan A. Rossi (124 papers)
  7. Thien Huu Nguyen (61 papers)
Citations (86)
X Twitter Logo Streamline Icon: https://streamlinehq.com