Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Meltemi: The first open Large Language Model for Greek (2407.20743v1)

Published 30 Jul 2024 in cs.CL

Abstract: We describe the development and capabilities of Meltemi 7B, the first open LLM for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information up to September 2023. Furthermore, we have translated and curated a Greek instruction corpus, which has been used for the instruction-tuning of a chat model, named Meltemi 7B Instruct. Special care has been given to the alignment and the removal of toxic content for the Meltemi 7B Instruct. The developed models are evaluated on a broad set of collected evaluation corpora, and examples of prompts and responses are presented. Both Meltemi 7B and Meltemi 7B Instruct are available at https://huggingface.co/ilsp under the Apache 2.0 license.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. The Falcon Series of Open Language Models. arXiv e-prints, page arXiv:2311.16867.
  2. Tower: An open multilingual large language model for translation-related tasks. arXiv preprint arXiv:2402.17733.
  3. Mikel Artetxe and Holger Schwenk. 2018. Margin-based parallel corpus mining with multilingual sentence embeddings. arXiv preprint arXiv:1811.01136.
  4. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the association for computational linguistics, 7:597–610.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609.
  6. The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. arXiv:2308.16884.
  7. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation.
  8. A Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, page 21.
  9. Multieurlex – a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  10. Large-scale multi-label text classification on EU legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6314–6322, Florence, Italy. Association for Computational Linguistics.
  11. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1.
  12. Efficiently adapting pretrained language models to new languages. arXiv preprint arXiv:2311.05741.
  13. The ParlaMint corpora of parliamentary proceedings. Lang. Resour. Eval., 57(1):415–448.
  14. Lighteval: A lightweight framework for llm evaluation.
  15. The CLARIN:EL infrastructure: Platform, Portal, K-Centre. In Selected papers from the CLARIN Annual Conference 2023.
  16. A New Massive Multilingual Dataset for High-Performance Language Technologies. arXiv:2403.14009.
  17. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  18. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
  19. Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763.
  20. INSAIT. 2024. BgGPT-7B, a Bulgarian language model. https://huggingface.co/tokyotech-llm/Swallow-MS-7b-v0.1. Accessed: (12 July 2024).
  21. Mistral 7b. arXiv preprint arXiv:2310.06825.
  22. LAION. 2023. LeoLM: Igniting German-Language LLM Research. https://laion.ai/blog/leo-lm/. Accessed: (12 July 2024).
  23. Mining of massive data sets. Cambridge university press.
  24. Rakutenai-7b: Extending large language models for japanese.
  25. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  26. Collection and Curation of Language Data within the European Language Resource Coordination (ELRC). In Qurator.
  27. Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
  28. CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400.
  29. The ILSP/ARC submission to the WMT 2018 parallel corpus filtering shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 928–933, Belgium, Brussels. Association for Computational Linguistics.
  30. The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557.
  31. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems, 36:79155–79172.
  32. Sabiá: Portuguese large language models. In Intelligent Systems, pages 226–240, Cham. Springer Nature Switzerland.
  33. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press.
  34. Dimitrios Roussis and Vassilis Papavassiliou. 2022. The ARC-NKUA submission for the English-Ukrainian general machine translation shared task at WMT22. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 358–365, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  35. SciPar: A collection of parallel corpora from scientific abstracts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2652–2657, Marseille, France. European Language Resources Association.
  36. Constructing parallel corpora from COVID-19 news using MediSys metadata. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1068–1072, Marseille, France. European Language Resources Association.
  37. Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Brussels, Belgium. Association for Computational Linguistics.
  38. Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. arXiv preprint arXiv:2308.16149.
  39. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
  40. Zyda: A 1.3 t dataset for open language modeling. arXiv preprint arXiv:2406.01981.
  41. TokyoTech-LLM. 2024. The Swallow-MS-7b-v0.1 model. https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.2. Accessed: (12 July 2024).
  42. The Alignment Handbook. https://github.com/huggingface/alignment-handbook.
  43. TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl.
  44. Bicleaner AI: Bicleaner goes neural. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 824–831, Marseille, France. European Language Resources Association.
  45. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  46. Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625.
  47. How do large language models handle multilingualism? arXiv preprint arXiv:2402.18815.
  48. Ǎguila. 2023. Introducing Ǎguila, a new open-source LLM for Spanish and Catalan. https://huggingface.co/projecte-aina/aguila-7b. Accessed: (12 July 2024).
Citations (1)

Summary

  • The paper introduces Meltemi 7B, the first open LLM for Greek, leveraging 7B parameters and a 43B token corpus to enhance language-specific AI.
  • It details a methodology that extends the tokenizer to 61,362 tokens and uses continual pretraining, achieving a 20.2% improvement on Greek benchmarks.
  • Meltemi 7B Instruct employs ORPO for instruction tuning, optimizing chat functionalities and setting a new standard for localized AI applications.

Meltemi: The First Open LLM for Greek

Overview

The paper presents Meltemi 7B, the first open LLM specifically developed for the Greek language. This model, built on the foundation of Mistral 7B, has 7 billion parameters and is trained using a 40 billion token Greek corpus. Meltemi 7B represents a significant advancement in language-specific AI, particularly for underrepresented languages like Greek. A key feature of this project is that it does not merely stop at the LLM but also introduces Meltemi 7B Instruct, an instruction-tuned variant optimized for chat-based applications.

Methodology

Data Collection

The development of Meltemi 7B hinges on a comprehensive Greek-language corpus sourced from diverse domains, including Wikipedia, legal texts from EUR-LEX, linguistics resources such as CLARIN-EL, and academic repositories. The total corpus amounts to 43 billion Greek tokens, supplemented by English tokens and parallel Greek-English data.

Tokenizer and Embeddings Expansion

A pivotal step in adapting the Mistral 7B model to Greek involved extending its tokenizer from 32,000 to 61,362 tokens to efficiently encode Greek text. Preliminary tests indicated that without this expansion, the original tokenizer produced significantly higher token counts for Greek compared to English, resulting in higher computational costs. The new tokens' embedding was trained in two stages: initially training only the new embeddings followed by whole-model training.

Continual Pretraining

Continual pretraining was implemented to adapt Mistral 7B to Greek. This process involved two distinct training phases, using techniques like rewarming and redecaying the learning rate to mitigate catastrophic forgetting due to language distribution shifts. The methodology integrated a mix of Greek and English monolingual data to maintain multilingual capabilities while enhancing Greek language understanding.

Instruction Tuning

Meltemi 7B Instruct is fine-tuned using the Optimized Reinforcement Preference Optimization (ORPO) algorithm, leveraging a high-quality preference dataset. This dataset comprised 97,072 preference triplets, inclusive of specially tailored system messages to facilitate chat functionality.

Evaluation

The models were evaluated across a suite of Greek benchmarks developed from translated versions of established English datasets. The evaluation corpus included tasks related to multiple-choice queries, commonsense reasoning, and domain-specific knowledge like medical question answering.

Results

The results indicate that Meltemi 7B significantly outperforms Mistral 7B on Greek benchmarks with an average improvement of 20.2%. However, there was a noted 6% decrease in performance on English benchmarks, highlighting a trade-off inherent in adapting LLMs to new languages. For instance, Meltemi 7B achieved:

  • 47.17 on the ARC-C Greek test set compared to Mistral 7B's 27.22
  • 68.66 on Belebele (ell) against Mistral 7B's 35.77
  • 65.75 on HellaSwag Greek against Mistral 7B's 35.20

Meltemi 7B Instruct further refined performance, particularly in instruction-following tasks, suggesting the effectiveness of the ORPO-aligned instruction dataset.

Implications and Future Work

The development of Meltemi 7B paves the way for more inclusive AI models capable of cultural and linguistic nuances needed for localized applications. This model has immediate applications in areas such as legal analysis, academic research, and public service automation within Greek-speaking communities.

Future research could focus on optimizing the balance between multilingual capabilities and target language performance to minimize performance degradation in non-target languages. Additionally, exploring multimodal extensions and scaling models to accommodate more parameters while maintaining computational efficiency are promising directions. Furthermore, a broader discussion on the sustainability of such models should be encouraged, emphasizing economic and environmental considerations.

Conclusion

Meltemi 7B represents a significant step forward in AI LLMing for Greek, demonstrating the feasibility and impact of developing large-scale models for underrepresented languages. This research highlights the importance of continual pretraining and tailored instruction tuning, setting a precedent for future efforts in this domain. The availability of these models under the Apache 2.0 license democratizes access and fosters continued innovation in language-specific AI applications.