Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Camoscio: an Italian Instruction-tuned LLaMA (2307.16456v2)

Published 31 Jul 2023 in cs.CL

Abstract: In recent years LLMs have increased the state of the art on several natural language processing tasks. However, their accessibility is often limited to paid API services, posing challenges for researchers in conducting extensive investigations. On the other hand, while some open-source models have been proposed by the community, they are typically English-centric or multilingual without a specific adaptation for the Italian language. In an effort to democratize the available and open resources for the Italian language, in this paper we introduce Camoscio: a LLM specifically tuned to follow users' prompts in Italian. Specifically, we finetuned the smallest variant of LLaMA (7b) with LoRA on a corpus of instruction prompts translated to Italian via ChatGPT. Results indicate that the model's zero-shot performance on various downstream tasks in Italian competes favorably with existing models specifically finetuned for those tasks. All the artifacts (code, dataset, model) are released to the community at the following url: https://github.com/teelinsan/camoscio

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
  2. Palm: Scaling language modeling with pathways, arXiv preprint arXiv:2204.02311 (2022).
  3. Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022).
  4. Bloom: A 176b-parameter open-access multilingual language model, arXiv preprint arXiv:2211.05100 (2022).
  5. Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023).
  6. Multitask prompted training enables zero-shot task generalization, in: International Conference on Learning Representations, 2022. URL: https://openreview.net/forum?id=9Vrb9D0WI4.
  7. Finetuned language models are zero-shot learners, in: International Conference on Learning Representations, 2021.
  8. Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022).
  9. LoRA: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2022. URL: https://openreview.net/forum?id=nZeVKeeFYf9.
  10. On the opportunities and risks of foundation models, arXiv preprint arXiv:2108.07258 (2021).
  11. Latent autoregressive source separation, Proceedings of the AAAI Conference on Artificial Intelligence 37 (2023) 9444–9452. URL: https://ojs.aaai.org/index.php/AAAI/article/view/26131. doi:10.1609/aaai.v37i8.26131.
  12. An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=YicbFdNTTy.
  13. Multimodal neural databases, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 2619–2628. URL: https://doi.org/10.1145/3539618.3591930. doi:10.1145/3539618.3591930.
  14. Accelerating transformer inference for translation via parallel decoding, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 12336–12355. URL: https://aclanthology.org/2023.acl-long.689.
  15. Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets, in: CEUR Workshop Proceedings, volume 2481, CEUR, 2019, pp. 1–6.
  16. BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423.
  17. Long-term social media data collection at the university of turin, in: Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), CEUR-WS, 2018, pp. 1–6.
  18. Geppetto carves italian into a language model, in: J. Monti, F. Dell’Orletta, F. Tamburini (Eds.), Proceedings of the Seventh Italian Conference on Computational Linguistics, CLiC-it 2020, Bologna, Italy, March 1-3, 2021, volume 2769 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: https://ceur-ws.org/Vol-2769/paper_46.pdf.
  19. The wacky wide web: a collection of very large linguistically processed web-crawled corpora, Language resources and evaluation 43 (2009) 209–226.
  20. G. Sarti, M. Nissim, It5: Large-scale text-to-text pretraining for italian language understanding and generation, arXiv preprint arXiv:2203.03759 (2022).
  21. mT5: A massively multilingual pre-trained text-to-text transformer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 483–498. URL: https://aclanthology.org/2021.naacl-main.41. doi:10.18653/v1/2021.naacl-main.41.
  22. M. La Quatra, L. Cagliero, Bart-it: An efficient sequence-to-sequence model for italian text summarization, Future Internet 15 (2023). URL: https://www.mdpi.com/1999-5903/15/1/15. doi:10.3390/fi15010015.
  23. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703.
  24. Fauno: The italian large language model that will leave you senza parole!, in: F. M. Nardini, N. Tonellotto, G. Faggioli, A. Ferrara (Eds.), Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023), Pisa, Italy, June 8-9, 2023, volume 3448 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 9–17. URL: https://ceur-ws.org/Vol-3448/paper-24.pdf.
  25. Instruction tuning with gpt-4, arXiv preprint arXiv:2304.03277 (2023).
  26. Self-instruct: Aligning language models with self-generated instructions, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 13484–13508. URL: https://aclanthology.org/2023.acl-long.754.
  27. Neural learning for question answering in italian, in: AI* IA 2018–Advances in Artificial Intelligence: XVIIth International Conference of the Italian Association for Artificial Intelligence, Trento, Italy, November 20–23, 2018, Proceedings 17, Springer, 2018, pp. 389–402.
  28. Large scale datasets for image and video captioning in italian, Italian Journal of Computational Linguistics 2 (2019) 49–60. URL: http://www.ai-lc.it/IJCoL/v5n2/IJCOL_5_2_3___scaiella_et_al.pdf.
  29. Deep bidirectional transformers for italian question answering, in: R. Bernardi, R. Navigli, G. Semeraro (Eds.), Proceedings of the Sixth Italian Conference on Computational Linguistics, Bari, Italy, November 13-15, 2019, volume 2481 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL: https://ceur-ws.org/Vol-2481/paper25.pdf.
  30. Synthetic data augmentation for zero-shot cross-lingual question answering, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 7016–7030.
  31. The curious case of neural text degeneration, in: International Conference on Learning Representations, 2020. URL: https://openreview.net/forum?id=rygGQyrFvH.
  32. UINAUIL: A unified benchmark for Italian natural language understanding, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 348–356. URL: https://aclanthology.org/2023.acl-demo.33. doi:10.18653/v1/2023.acl-demo.33.
  33. S. Casola, A. Lavelli, WITS: wikipedia for italian text summarization, in: E. Fersini, M. Passarotti, V. Patti (Eds.), Proceedings of the Eighth Italian Conference on Computational Linguistics, CLiC-it 2021, Milan, Italy, January 26-28, 2022, volume 3033 of CEUR Workshop Proceedings, CEUR-WS.org, 2021. URL: https://ceur-ws.org/Vol-3033/paper65.pdf.
  34. Change-it@ evalita 2020: Change headlines, adapt news, generate, Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020) 2765 (2020).
  35. Datasets: A community library for natural language processing, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 175–184. URL: https://aclanthology.org/2021.emnlp-demo.21. arXiv:2109.02846.
  36. Squad: 100,000+ questions for machine comprehension of text, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2383–2392.
  37. Olá, bonjour, salve! xformal: A benchmark for multilingual formality style transfer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3199–3216.
  38. S. Rao, J. Tetreault, Dear sir or madam, may i introduce the gyafc dataset: Corpus, benchmarks and metrics for formality style transfer, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 129–140.
  39. C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013.
  40. Bertscore: Evaluating text generation with bert, in: International Conference on Learning Representations, 2019.
  41. Survey of hallucination in natural language generation, ACM Comput. Surv. 55 (2023). URL: https://doi.org/10.1145/3571730. doi:10.1145/3571730.
  42. On the dangers of stochastic parrots: Can language models be too big?, in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 610–623. URL: https://doi.org/10.1145/3442188.3445922. doi:10.1145/3442188.3445922.
  43. The woman worked as a babysitter: On biases in language generation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3407–3412. URL: https://aclanthology.org/D19-1339. doi:10.18653/v1/D19-1339.
  44. Transformers: State-of-the-Art Natural Language Processing, Association for Computational Linguistics, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  45. Llm.int8(): 8-bit matrix multiplication for transformers at scale, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates, Inc., 2022, pp. 30318–30332.
  46. I. Loshchilov, F. Hutter, Decoupled weight decay regularization, in: International Conference on Learning Representations, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Andrea Santilli (17 papers)
  2. Emanuele Rodolà (90 papers)
Citations (21)

Summary

Camoscio: An Italian Instruction-tuned LLaMA

The paper presents "Camoscio", an Italian-specific adaptation of the LLaMA LLM that has been instruction-tuned to handle Italian prompts. This research fills a pertinent gap in the computational linguistics landscape, where existing LLMs are predominantly English-centric, or offer multilingual capabilities that often underperform in non-English languages. The work aims to democratize AI resources for the Italian language by releasing an open-source model tuned specifically for Italian, demonstrating its competitive performance on various NLP tasks in a zero-shot context.

Methodology

The researchers created Camoscio by finetuning the smallest variant of LLaMA (7 billion parameters) using Low-Rank Adaptation (LoRA), a parameter-efficient finetuning technique. The finetuning involved an instruction-tuning dataset translated into Italian from the Stanford Alpaca dataset via ChatGPT. The authors provide detailed descriptions of the translation process and finetuning configuration, emphasizing the use of desktop hardware which contributes to the accessibility and usability of the model for a wider community of researchers and developers.

Evaluation

Camoscio was evaluated on three prominent NLP tasks tailored for the Italian language: news summarization (using the NewsSum-IT dataset), question answering (using the SQuAD-IT dataset), and formality style transfer (using the XFORMAL IT dataset). The results indicate that, despite being evaluated in a zero-shot setting, Camoscio's performance aligns with that of models specifically finetuned on these downstream tasks. The paper highlights the limitations of traditional metrics when evaluating performance in a zero-shot context, proposing the "Exact Match via ChatGPT" as a complementary metric to better capture Camoscio's generative capabilities.

Implications and Future Directions

The implications of this research are significant, especially for domains where Italian-specific LLMs have been lacking or inadequate due to poor support in multilingual models. The introduction of Camoscio provides a foundation for further exploration and enhancement of instruction-tuning techniques for other non-English languages. The paper also prompts the necessity for improved evaluation metrics that can accurately capture zero-shot performance of LLMs. Future exploration could involve expanding the instruction-tuning dataset with additional Italian-specific contexts or including more diverse NLP tasks to broaden Camoscio's functionality and robustness.

Conclusion

By introducing Camoscio, the authors take an essential step toward enhancing the availability and efficacy of instruction-tuned models for the Italian language. This work not only supplements existing open-source LLM initiatives but also provides a valuable resource for further academic and practical exploration in monolingual NLP applications. It underscores the potential of well-directed finetuning in enhancing language-specific performance without reliance on proprietary systems, aligning with the broader AI community's efforts to democratize AI technologies.

Github Logo Streamline Icon: https://streamlinehq.com