Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 37 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 10 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4 31 tok/s Pro

2000 character limit reached

Generative Model for Less-Resourced Language with 1 billion parameters (2410.06898v1)

Published 9 Oct 2024 in cs.CL

Abstract: LLMs are a basic infrastructure for modern natural language processing. Many commercial and open-source LLMs exist for English, e.g., ChatGPT, Llama, Falcon, and Mistral. As these models are trained on mostly English texts, their fluency and knowledge of low-resource languages and societies are superficial. We present the development of large generative LLMs for a less-resourced language. GaMS 1B - Generative Model for Slovene with 1 billion parameters was created by continuing pretraining of the existing English OPT model. We developed a new tokenizer adapted to Slovene, Croatian, and English languages and used embedding initialization methods FOCUS and WECHSEL to transfer the embeddings from the English OPT model. We evaluate our models on several classification datasets from the Slovene suite of benchmarks and generative sentence simplification task SENTA. We only used a few-shot in-context learning of our models, which are not yet instruction-tuned. For classification tasks, in this mode, the generative models lag behind the existing Slovene BERT-type models fine-tuned for specific tasks. On a sentence simplification task, the GaMS models achieve comparable or better performance than the GPT-3.5-Turbo model.

Summary

The paper presents GaMS 1B, which adapts an English-based OPT model to address the challenges of limited NLP resources in Slovene.
It employs meticulous tokenizer development and advanced embedding transfer techniques, such as FOCUS and WECHSEL, to optimize performance on scarce datasets.
Results show competitive performance in sentence simplification, establishing a framework for extending large generative models to less-resourced languages.

Generative Model for Less-Resourced Languages: GaMS 1B for Slovene

The paper introduces a novel approach to developing a large generative LLM for Slovene—a language with limited NLP resources. Addressing the gap in existing LLMs predominantly trained on English, the authors present the GaMS 1B, a generative model specifically tailored for Slovene, leveraging an existing English model, the Open Pre-trained Transformer (OPT). The paper meticulously details the adaptation and enhancement strategies employed to tailor the model for Slovene, highlighting a generalizable approach for other less-resourced languages.

Methodology and Model Development

The primary challenge addressed in the paper is the scarcity of large Slovene datasets, making training from scratch impractical. Instead, the authors adapt the English language OPT model, which involves a comprehensive strategy for transferring the learned knowledge across languages. This includes the development of a new tokenizer optimized for Slovene, Croatian, and English, and the application of advanced embedding transfer techniques, namely FOCUS and WECHSEL. These techniques facilitate the initialization of embeddings, crucial for maintaining model efficiency and performance after vocabulary modification.

Tokenizer Development: The researchers train several tokenizer variants, focusing on vocabulary size and its impact on token generation efficiency. Ultimately, a 80,000 token vocabulary was chosen to balance computational cost and performance, ensuring more efficient processing of Slovene text compared to the original OPT tokenizer.
Embedding Initialization: The paper explores and extends upon the WECHSEL and FOCUS methods for embedding initialization. By testing the impact of these methods with and without CroSloEngual BERT embeddings as a bridging mechanism, the paper evaluates their efficacy in mitigating the adverse effects of vocabulary change, a critical aspect in multilingual model adaptation.
Training Process: Utilizing the HPC Vega supercomputer, the model is further pretrained on a diverse corpus comprising Slovene, Croatian, and other regional languages while preventing English forgetting. Given the scale and diversity of training data, strategic data selection and epoch repetition are employed to optimize training outcomes.

Evaluation and Results

Despite the innovations in model design, the authors candidly discuss the challenges in benchmarking the performance of LLMs adapted to less-resourced languages. Key findings include:

Classification Tasks: GaMS 1B's performance on Slovene SuperGLUE tasks reveals difficulty in task understanding attributable in part to the model's relatively small size (1 billion parameters), and the lack of instruction tuning, which is typically beneficial for nuanced task handling.
Sentence Simplification: The GaMS 1B exhibits competitive results in the SENTA sentence simplification task, achieving outcomes comparable to the state-of-the-art GPT-3.5-Turbo. This indicates the model's proficiency in generative tasks even in its preliminary adaptation stage.

Implications and Future Work

This paper makes a significant contribution to the field by providing a framework for extending the benefits of LLMs to less-resourced languages through knowledge transfer and model adaptation. The findings suggest that with further instruction tuning and potential model scaling, the efficacy of GaMS 1B could be significantly enhanced, opening avenues for improved multilingual NLP applications.

Future research as indicated by the authors will focus on developing an instruction-following dataset to better equip the model for diverse tasks. Further exploration into larger models also remains a priority to amplify performance disparities and draw more conclusive insights into the efficacy of embedding transfer methods.

Overall, the GaMS 1B serves as a pioneering effort in resource-efficient NLP model creation, highlighting viable pathways for extending AI capabilities to a broader array of languages, thereby democratizing AI advancements on a global scale.