Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

125 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Meltemi: The first open Large Language Model for Greek (2407.20743v1)

Published 30 Jul 2024 in cs.CL

Abstract: We describe the development and capabilities of Meltemi 7B, the first open LLM for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information up to September 2023. Furthermore, we have translated and curated a Greek instruction corpus, which has been used for the instruction-tuning of a chat model, named Meltemi 7B Instruct. Special care has been given to the alignment and the removal of toxic content for the Meltemi 7B Instruct. The developed models are evaluated on a broad set of collected evaluation corpora, and examples of prompts and responses are presented. Both Meltemi 7B and Meltemi 7B Instruct are available at https://huggingface.co/ilsp under the Apache 2.0 license.

References (48)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Meltemi 7B, the first open LLM for Greek, leveraging 7B parameters and a 43B token corpus to enhance language-specific AI.
It details a methodology that extends the tokenizer to 61,362 tokens and uses continual pretraining, achieving a 20.2% improvement on Greek benchmarks.
Meltemi 7B Instruct employs ORPO for instruction tuning, optimizing chat functionalities and setting a new standard for localized AI applications.

Meltemi: The First Open LLM for Greek

Overview

The paper presents Meltemi 7B, the first open LLM specifically developed for the Greek language. This model, built on the foundation of Mistral 7B, has 7 billion parameters and is trained using a 40 billion token Greek corpus. Meltemi 7B represents a significant advancement in language-specific AI, particularly for underrepresented languages like Greek. A key feature of this project is that it does not merely stop at the LLM but also introduces Meltemi 7B Instruct, an instruction-tuned variant optimized for chat-based applications.

Methodology

Data Collection

The development of Meltemi 7B hinges on a comprehensive Greek-language corpus sourced from diverse domains, including Wikipedia, legal texts from EUR-LEX, linguistics resources such as CLARIN-EL, and academic repositories. The total corpus amounts to 43 billion Greek tokens, supplemented by English tokens and parallel Greek-English data.

Tokenizer and Embeddings Expansion

A pivotal step in adapting the Mistral 7B model to Greek involved extending its tokenizer from 32,000 to 61,362 tokens to efficiently encode Greek text. Preliminary tests indicated that without this expansion, the original tokenizer produced significantly higher token counts for Greek compared to English, resulting in higher computational costs. The new tokens' embedding was trained in two stages: initially training only the new embeddings followed by whole-model training.

Continual Pretraining

Continual pretraining was implemented to adapt Mistral 7B to Greek. This process involved two distinct training phases, using techniques like rewarming and redecaying the learning rate to mitigate catastrophic forgetting due to language distribution shifts. The methodology integrated a mix of Greek and English monolingual data to maintain multilingual capabilities while enhancing Greek language understanding.

Instruction Tuning

Meltemi 7B Instruct is fine-tuned using the Optimized Reinforcement Preference Optimization (ORPO) algorithm, leveraging a high-quality preference dataset. This dataset comprised 97,072 preference triplets, inclusive of specially tailored system messages to facilitate chat functionality.

Evaluation

The models were evaluated across a suite of Greek benchmarks developed from translated versions of established English datasets. The evaluation corpus included tasks related to multiple-choice queries, commonsense reasoning, and domain-specific knowledge like medical question answering.

Results

The results indicate that Meltemi 7B significantly outperforms Mistral 7B on Greek benchmarks with an average improvement of 20.2%. However, there was a noted 6% decrease in performance on English benchmarks, highlighting a trade-off inherent in adapting LLMs to new languages. For instance, Meltemi 7B achieved:

47.17 on the ARC-C Greek test set compared to Mistral 7B's 27.22
68.66 on Belebele (ell) against Mistral 7B's 35.77
65.75 on HellaSwag Greek against Mistral 7B's 35.20

Meltemi 7B Instruct further refined performance, particularly in instruction-following tasks, suggesting the effectiveness of the ORPO-aligned instruction dataset.

Implications and Future Work

The development of Meltemi 7B paves the way for more inclusive AI models capable of cultural and linguistic nuances needed for localized applications. This model has immediate applications in areas such as legal analysis, academic research, and public service automation within Greek-speaking communities.

Future research could focus on optimizing the balance between multilingual capabilities and target language performance to minimize performance degradation in non-target languages. Additionally, exploring multimodal extensions and scaling models to accommodate more parameters while maintaining computational efficiency are promising directions. Furthermore, a broader discussion on the sustainability of such models should be encouraged, emphasizing economic and environmental considerations.

Conclusion

Meltemi 7B represents a significant step forward in AI LLMing for Greek, demonstrating the feasibility and impact of developing large-scale models for underrepresented languages. This research highlights the importance of continual pretraining and tailored instruction tuning, setting a precedent for future efforts in this domain. The availability of these models under the Apache 2.0 license democratizes access and fosters continued innovation in language-specific AI applications.

PDF Markdown

Tweets

https://twitter.com/ILSP_AthenaRC/status/1819281266870497391

https://twitter.com/Ar_Douillard/status/1818640753473577074

https://twitter.com/susumuota/status/1819162528066257045