Camoscio: an Italian Instruction-tuned LLaMA (2307.16456v2)

Published 31 Jul 2023 in cs.CL

Abstract: In recent years LLMs have increased the state of the art on several natural language processing tasks. However, their accessibility is often limited to paid API services, posing challenges for researchers in conducting extensive investigations. On the other hand, while some open-source models have been proposed by the community, they are typically English-centric or multilingual without a specific adaptation for the Italian language. In an effort to democratize the available and open resources for the Italian language, in this paper we introduce Camoscio: a LLM specifically tuned to follow users' prompts in Italian. Specifically, we finetuned the smallest variant of LLaMA (7b) with LoRA on a corpus of instruction prompts translated to Italian via ChatGPT. Results indicate that the model's zero-shot performance on various downstream tasks in Italian competes favorably with existing models specifically finetuned for those tasks. All the artifacts (code, dataset, model) are released to the community at the following url: https://github.com/teelinsan/camoscio

References (46)

Authors (2)

Andrea Santilli (17 papers)
Emanuele Rodolà (90 papers)

Citations (21)

View on Semantic Scholar

Summary

Camoscio: An Italian Instruction-tuned LLaMA

The paper presents "Camoscio", an Italian-specific adaptation of the LLaMA LLM that has been instruction-tuned to handle Italian prompts. This research fills a pertinent gap in the computational linguistics landscape, where existing LLMs are predominantly English-centric, or offer multilingual capabilities that often underperform in non-English languages. The work aims to democratize AI resources for the Italian language by releasing an open-source model tuned specifically for Italian, demonstrating its competitive performance on various NLP tasks in a zero-shot context.

Methodology

The researchers created Camoscio by finetuning the smallest variant of LLaMA (7 billion parameters) using Low-Rank Adaptation (LoRA), a parameter-efficient finetuning technique. The finetuning involved an instruction-tuning dataset translated into Italian from the Stanford Alpaca dataset via ChatGPT. The authors provide detailed descriptions of the translation process and finetuning configuration, emphasizing the use of desktop hardware which contributes to the accessibility and usability of the model for a wider community of researchers and developers.

Evaluation

Camoscio was evaluated on three prominent NLP tasks tailored for the Italian language: news summarization (using the NewsSum-IT dataset), question answering (using the SQuAD-IT dataset), and formality style transfer (using the XFORMAL IT dataset). The results indicate that, despite being evaluated in a zero-shot setting, Camoscio's performance aligns with that of models specifically finetuned on these downstream tasks. The paper highlights the limitations of traditional metrics when evaluating performance in a zero-shot context, proposing the "Exact Match via ChatGPT" as a complementary metric to better capture Camoscio's generative capabilities.

Implications and Future Directions

The implications of this research are significant, especially for domains where Italian-specific LLMs have been lacking or inadequate due to poor support in multilingual models. The introduction of Camoscio provides a foundation for further exploration and enhancement of instruction-tuning techniques for other non-English languages. The paper also prompts the necessity for improved evaluation metrics that can accurately capture zero-shot performance of LLMs. Future exploration could involve expanding the instruction-tuning dataset with additional Italian-specific contexts or including more diverse NLP tasks to broaden Camoscio's functionality and robustness.

Conclusion

By introducing Camoscio, the authors take an essential step toward enhancing the availability and efficacy of instruction-tuned models for the Italian language. This work not only supplements existing open-source LLM initiatives but also provides a valuable resource for further academic and practical exploration in monolingual NLP applications. It underscores the potential of well-directed finetuning in enhancing language-specific performance without reliance on proprietary systems, aligning with the broader AI community's efforts to democratize AI technologies.

PDF Markdown

GitHub

GitHub - teelinsan/camoscio: Camoscio: An Italian instruction-tuned language model based on LLaMA (127 stars)