FinGPT: Large Generative Models for a Small Language (2311.05640v1)

Published 3 Nov 2023 in cs.CL

Abstract: LLMs excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.

Authors (21)

Risto Luukkonen (2 papers)
Ville Komulainen (4 papers)
Jouni Luoma (3 papers)
Anni Eskelinen (2 papers)
Jenna Kanerva (17 papers)
Hanna-Mari Kupari (3 papers)
Filip Ginter (28 papers)
Veronika Laippala (8 papers)
Niklas Muennighoff (56 papers)
Aleksandra Piktus (20 papers)
Thomas Wang (17 papers)
Nouamane Tazi (8 papers)
Teven Le Scao (18 papers)
Thomas Wolf (117 papers)
Osma Suominen (2 papers)
Samuli Sairanen (1 paper)
Mikko Merioksa (1 paper)
Jyrki Heinonen (1 paper)
Aija Vahtola (1 paper)
Samuel Antao (1 paper)

Citations (30)

View on Semantic Scholar

Summary

Exploring the Architectures and Evaluation of FinGPT: Large Generative Models for Finnish

The development of large-scale LLMs has shifted the landscape of NLP, but these advances have generally excluded smaller languages due to limited data availability. The paper "FinGPT: Large Generative Models for a Small Language" tackles this discrepancy by focusing on the development of generative LLMs tailored for Finnish, a language with less than 6 million native speakers. Through their two-pronged approach, the authors introduce seven monolingual models dubbed FinGPT, ranging from 186 million to 13.3 billion parameters, and a multilingual model named BLUUMI, extending the capabilities of the 176 billion parameter BLOOM model to accommodate Finnish.

Novel Contributions

The paper emphasizes a key challenge in constructing large LLMs for Finnish: the scarcity of extensive high-quality data. To tackle this, the authors accumulate a comprehensive collection of Finnish texts from a diverse array of sources including web crawls, news articles, social media, and eBooks, reaching a cumulative size of 300 billion tokens. A critical component of the research is the establishment of FIN-bench, a benchmark dataset crafted to gauge the proficiency of models in Finnish-specific tasks, complementing commonly utilized benchmarks like BIG-bench.

Model Architectures and Training Regimens

For model development, the work draws inspiration from the GPT and BLOOM architectures. The FinGPT models adhere to a monolingual training regime while adopting select architecture details from GPT-3 pertaining to layer characteristics and dimensional parameters. Conversely, the multilingual BLUUMI model is an adaptation of BLOOM, as it integrates Finnish text into its pretraining data—a significant augmentation given BLOOM originally lacked Finnish language representation. The intricacies of these architectures, such as layer normalization and Alibi position embeddings, are meticulously designed to optimize processing efficiency.

The stated architectures, workloads, and hyperparameters display the computational intensity and precision required for training LLMs. The authors placed significant emphasis on scale, as illustrated by the training conducted on the LUMI supercomputer, exploiting up to 1536 GPUs to tackle the immense computational demands intrinsic to such large parameter spaces.

Evaluation and Results

The paper presents insightful findings through the evaluation using the FIN-bench dataset in varying shot settings (zero, one, two, and three). Notably, the BLUUMI model demonstrates marked superiority over preexisting models in multitask scenarios, underlining its enhanced capability in capturing the Finnish language. Contrastingly, the largest monolingual model (13B parameters) does not demonstrate linear improvement over its smaller counterparts, possibly indicating overfitting due to limited distinct data available per epoch.

Beyond task performance, the paper scrutinizes the models in terms of alignment, bias, and toxicity. The authors highlight the alignment challenges through the HHH benchmark and underscore concerns around bias, exemplified by observed gender-specific predictions. Despite employing rigorous filtering mechanisms during pretraining, the models still show evidence of generating toxic content, albeit at reduced levels compared to prior models without toxicity filtering.

Implications and Future Directions

The implications of this work are twofold. Practically, it establishes a robust framework for extending LLM capabilities to languages with similar resource constraints as Finnish. Theoretically, it serves as a case paper for devising balanced pretraining regimens when faced with limited data availability. The approach showcased in this paper opens the door to similar endeavors for other underrepresented languages, promoting linguistic inclusivity in AI.

Looking forward, further efforts to align the models—in analogy to techniques like reinforcement learning with human feedback (RLHF)—could enhance the practicality of these models in real-world scenarios. Additionally, explorations in data augmentation could enrich the effective token pool available for smaller languages, augmenting future LLM efforts.

In conclusion, "FinGPT: Large Generative Models for a Small Language" extends the reach and inclusivity of state-of-the-art NLP technologies to Finnish, offering a template for addressing the pronounced imbalance in LLM distribution across languages worldwide. With continued advances in training methodology and alignment techniques, models like FinGPT and BLUUMI could significantly democratize AI access and utilization globally.

PDF Markdown

Related Papers

YouTube

Show All Videos