Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cerbero-7B: A Leap Forward in Language-Specific LLMs Through Enhanced Chat Corpus Generation and Evaluation (2311.15698v1)

Published 27 Nov 2023 in cs.CL and cs.AI

Abstract: This study introduces a novel approach for generating high-quality, language-specific chat corpora using a self-chat mechanism. We combine a generator LLM for creating new samples and an embedder LLM to ensure diversity. A new Masked LLMling (MLM) model-based quality assessment metric is proposed for evaluating and filtering the corpora. Utilizing the llama2-70b as the generator and a multilingual sentence transformer as embedder, we generate an Italian chat corpus and refine the Fauno corpus, which is based on translated English ChatGPT self-chat data. The refinement uses structural assertions and Natural Language Processing techniques. Both corpora undergo a comprehensive quality evaluation using the proposed MLM model-based quality metric. The Italian LLM fine-tuned with these corpora demonstrates significantly enhanced language comprehension and question-answering skills. The resultant model, cerbero-7b, establishes a new state-of-the-art for Italian LLMs. This approach marks a substantial advancement in the development of language-specific LLMs, with a special emphasis on augmenting corpora for underrepresented languages like Italian.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Fauno: The Italian Large Language Model that will leave you senza parole!” In arXiv preprint arXiv:2306.14457, 2023
  2. “Flashattention: Fast and memory-efficient exact attention with io-awareness” In Advances in Neural Information Processing Systems 35, 2022, pp. 16344–16359
  3. “Geppetto carves italian into a language model” In arXiv preprint arXiv:2004.14253, 2020
  4. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186 DOI: 10.18653/v1/N19-1423
  5. “Learning towards conversational AI: A survey” In AI Open 3, 2022, pp. 14–28 DOI: https://doi.org/10.1016/j.aiopen.2022.02.001
  6. Federico A. Galatolo “cerbero-7b” In GitHub repository GitHub, https://github.com/galatolofederico/cerbero-7b, 2023
  7. “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2022 URL: https://openreview.net/forum?id=nZeVKeeFYf9
  8. “Mistral 7B” In arXiv preprint arXiv:2310.06825, 2023
  9. “OpenAssistant Conversations–Democratizing Large Language Model Alignment” In arXiv preprint arXiv:2304.07327, 2023
  10. “Nltk: The natural language toolkit” In arXiv preprint cs/0205028, 2002
  11. “Full Parameter Fine-tuning for Large Language Models with Limited Resources” In arXiv preprint arXiv:2306.09782, 2023
  12. OpenAI “ChatGPT: Optimizing Language Models for Dialogue” In OpenAI Blog, 2023
  13. OpenAI “GPT-4 Technical Report” In CoRR abs/2303.08774, 2023 DOI: 10.48550/ARXIV.2303.08774
  14. “Training language models to follow instructions with human feedback” In Advances in Neural Information Processing Systems 35, 2022, pp. 27730–27744
  15. “Improving language understanding by generative pre-training” OpenAI, 2018
  16. “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing Association for Computational Linguistics, 2020 URL: https://arxiv.org/abs/2004.09813
  17. “It5: Large-scale text-to-text pretraining for italian language understanding and generation” In arXiv preprint arXiv:2203.03759, 2022
  18. Peter M. Stahl “Lingua-Py” In GitHub repository GitHub, https://github.com/pemistahl/lingua-py, 2023
  19. “Stanford Alpaca: An Instruction-following LLaMA model” In GitHub repository GitHub, https://github.com/tatsu-lab/stanford_alpaca, 2023
  20. The Vicuna Team “Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality”, 2023 URL: https://lmsys.org/blog/2023-03-30-vicuna/
  21. “Llama 2: Open foundation and fine-tuned chat models” In arXiv preprint arXiv:2307.09288, 2023
  22. “Emergent Abilities of Large Language Models” Survey Certification In Transactions on Machine Learning Research, 2022 URL: https://openreview.net/forum?id=yzkSU5zdwD
  23. “Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data” In arXiv preprint arXiv:2304.01196, 2023
  24. “DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations Online: Association for Computational Linguistics, 2020, pp. 270–278 DOI: 10.18653/v1/2020.acl-demos.30
Citations (3)

Summary

We haven't generated a summary for this paper yet.