Meltemi: The first open Large Language Model for Greek (2407.20743v1)
Abstract: We describe the development and capabilities of Meltemi 7B, the first open LLM for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information up to September 2023. Furthermore, we have translated and curated a Greek instruction corpus, which has been used for the instruction-tuning of a chat model, named Meltemi 7B Instruct. Special care has been given to the alignment and the removal of toxic content for the Meltemi 7B Instruct. The developed models are evaluated on a broad set of collected evaluation corpora, and examples of prompts and responses are presented. Both Meltemi 7B and Meltemi 7B Instruct are available at https://huggingface.co/ilsp under the Apache 2.0 license.
- The Falcon Series of Open Language Models. arXiv e-prints, page arXiv:2311.16867.
- Tower: An open multilingual large language model for translation-related tasks. arXiv preprint arXiv:2402.17733.
- Mikel Artetxe and Holger Schwenk. 2018. Margin-based parallel corpus mining with multilingual sentence embeddings. arXiv preprint arXiv:1811.01136.
- Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the association for computational linguistics, 7:597–610.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. arXiv:2308.16884.
- MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation.
- A Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, page 21.
- Multieurlex – a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Large-scale multi-label text classification on EU legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6314–6322, Florence, Italy. Association for Computational Linguistics.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1.
- Efficiently adapting pretrained language models to new languages. arXiv preprint arXiv:2311.05741.
- The ParlaMint corpora of parliamentary proceedings. Lang. Resour. Eval., 57(1):415–448.
- Lighteval: A lightweight framework for llm evaluation.
- The CLARIN:EL infrastructure: Platform, Portal, K-Centre. In Selected papers from the CLARIN Annual Conference 2023.
- A New Massive Multilingual Dataset for High-Performance Language Technologies. arXiv:2403.14009.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
- Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763.
- INSAIT. 2024. BgGPT-7B, a Bulgarian language model. https://huggingface.co/tokyotech-llm/Swallow-MS-7b-v0.1. Accessed: (12 July 2024).
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- LAION. 2023. LeoLM: Igniting German-Language LLM Research. https://laion.ai/blog/leo-lm/. Accessed: (12 July 2024).
- Mining of massive data sets. Cambridge university press.
- Rakutenai-7b: Extending large language models for japanese.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Collection and Curation of Language Data within the European Language Resource Coordination (ELRC). In Qurator.
- Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. CoRR, abs/1711.05101.
- CulturaX: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400.
- The ILSP/ARC submission to the WMT 2018 parallel corpus filtering shared task. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 928–933, Belgium, Brussels. Association for Computational Linguistics.
- The fineweb datasets: Decanting the web for the finest text data at scale. arXiv preprint arXiv:2406.17557.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. Advances in Neural Information Processing Systems, 36:79155–79172.
- Sabiá: Portuguese large language models. In Intelligent Systems, pages 226–240, Cham. Springer Nature Switzerland.
- Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press.
- Dimitrios Roussis and Vassilis Papavassiliou. 2022. The ARC-NKUA submission for the English-Ukrainian general machine translation shared task at WMT22. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 358–365, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- SciPar: A collection of parallel corpora from scientific abstracts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2652–2657, Marseille, France. European Language Resources Association.
- Constructing parallel corpora from COVID-19 news using MediSys metadata. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1068–1072, Marseille, France. European Language Resources Association.
- Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, Brussels, Belgium. Association for Computational Linguistics.
- Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. arXiv preprint arXiv:2308.16149.
- Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
- Zyda: A 1.3 t dataset for open language modeling. arXiv preprint arXiv:2406.01981.
- TokyoTech-LLM. 2024. The Swallow-MS-7b-v0.1 model. https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.2. Accessed: (12 July 2024).
- The Alignment Handbook. https://github.com/huggingface/alignment-handbook.
- TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl.
- Bicleaner AI: Bicleaner goes neural. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 824–831, Marseille, France. European Language Resources Association.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625.
- How do large language models handle multilingualism? arXiv preprint arXiv:2402.18815.
- Ǎguila. 2023. Introducing Ǎguila, a new open-source LLM for Spanish and Catalan. https://huggingface.co/projecte-aina/aguila-7b. Accessed: (12 July 2024).