Goldfish: Monolingual Language Models for 350 Languages (2408.10441v1)

Published 19 Aug 2024 in cs.CL

Abstract: For many low-resource languages, the only available LLMs are large multilingual models trained on many languages simultaneously. However, using FLORES perplexity as a metric, we find that these models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B). To facilitate research that focuses on low-resource languages, we pre-train and release Goldfish, a suite of monolingual autoregressive Transformer LLMs up to 125M parameters for 350 languages. The Goldfish reach lower FLORES perplexities than BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages, despite each Goldfish model being over 10x smaller. However, the Goldfish significantly underperform larger multilingual models on reasoning benchmarks, suggesting that for low-resource languages, multilinguality primarily improves general reasoning abilities rather than basic text generation. We release models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) of text data where available. The Goldfish models are available as baselines, fine-tuning sources, or augmentations to existing models in low-resource NLP research, and they are further useful for crosslinguistic studies requiring maximally comparable models across languages.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a suite of efficient monolingual Transformer models that achieve 11-13% lower perplexities in 98 of 204 languages compared to larger models.
The paper demonstrates model scalability with training on datasets ranging from 5MB to 1GB, effectively addressing low-resource language challenges.
The paper highlights a trade-off between excellent text generation performance and reasoning benchmarks, suggesting room for hybrid modeling strategies.

Goldfish: Monolingual LLMs for 350 Languages

The paper "Goldfish: Monolingual LLMs for 350 Languages" introduces a comprehensive suite of monolingual autoregressive Transformer models, colloquially named Goldfish, designed to enhance NLP for 350 diverse languages. This research addresses the performance limitations and comparability issues inherent in existing large multilingual models for low-resource languages.

Key Contributions

Model Performance and Size:
- The Goldfish models, despite being over 10 times smaller than counterparts like BLOOM 7.1B, XGLM 4.5B, and MaLA-500 10B, achieve lower FLORES perplexities for 98 of 204 languages.
- Models perform particularly well with dataset sizes ranging from 5MB to 1GB, proving effective in filling the performance gap for low-resource languages.
Wide Range of Languages:
- The suite includes 1154 monolingual models, demonstrating substantial linguistic diversity. These models span 350 languages, representing varied geographic and linguistic features.
Data Efficiency:
- The Goldfish models are trained on multilingual datasets aggregated from prominent sources like Glot500, MASAKHANE, and MADLAD-400, ensuring comprehensive data coverage.
Perplexity vs. Reasoning:
- Despite impressive performance in text generation tasks as indicated by low FLORES perplexities, the Goldfish models underperform on reasoning benchmarks compared to larger, multilingual models. This underscores the trade-off between model size and task-specific proficiency.
Baseline Creation:
- The release of monolingual models trained on comparable data sizes assists in benchmarking and standardizing NLP research across low-resource languages.

Strong Numerical Results

Perplexity Improvement: Goldfish models achieve 13% lower perplexities on average compared to XGLM 4.5B and 11% lower compared to MaLA-500 10B.
Data Sizes:
- Models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) datasets, showing scalability across diverse data availability scenarios.

Implications of the Research

Practical Implications

The creation and release of Goldfish models have practical implications for the NLP landscape:

Language Coverage: Enables NLP applications in languages traditionally underrepresented, supporting language preservation and digital inclusion.
Data Utilization: Provides efficient models that perform well with limited data, advancing research capabilities in data-constrained environments.
Baseline Establishment: Facilitates the development of standardized benchmarks for low-resource languages, promoting fairer comparisons and progress tracking.

Theoretical Implications

On the theoretical front, the Goldfish models contribute to our understanding of:

Size-Efficiency Trade-offs: They empirically demonstrate that model size does not linearly correlate with performance, particularly for low-resource languages.
Multilingual vs. Monolingual Training: Highlighting that multilingual models might excel in complex tasks due to broader training data, whereas monolingual models can specialize effectively.
Cross-Linguistic Studies: Supporting the exploration of linguistic patterns and structures across diverse languages with comparable model architectures.

Speculations on Future AI Developments

The insights derived from the Goldfish models suggest several future research directions:

Hybrid Models: Integrating the strengths of both monolingual and multilingual approaches could lead to models that combine efficient data utilization with advanced reasoning capabilities.
Task-Specific Fine-Tuning: Developing more specialized fine-tuning techniques could bridge the gap between perplexity improvements and their application to high-level reasoning tasks.
Expanded Language Coverage: Continuing to include additional languages and dialects, especially those with newly emerging or curated datasets.

This paper is pivotal in illustrating that smaller, more focused models can outperform their larger counterparts in foundational text generation tasks for many low-resource languages. The Goldfish suite lays the groundwork for future innovation in low-resource NLP, suggesting that linguistic diversity can be supported effectively with scalable, efficient modeling approaches. This represents a significant step towards more inclusive and equal digital representation across the world's languages.

PDF Markdown

Related Papers

Tweets

https://twitter.com/linguist_cat/status/1826267170952863885

https://twitter.com/tylerachang/status/1827023786698506594