The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian (2403.18697v3)
Abstract: While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative LLMs in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympics. We evaluate 10 powerful LLMs on these benchmarks and find that they are bound by 71% accuracy on Invasli MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%. We will make data and evaluation code openly available upon acceptance of the paper.
- Yi: Open foundation models by 01.ai.
- Fauno: The Italian Large Language Model that will leave you senza parole! ArXiv:2306.14457 [cs].
- Llamantino: Llama 2 models for effective text generation in italian language.
- UINAUIL: A unified benchmark for Italian natural language understanding. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 348–356, Toronto, Canada. Association for Computational Linguistics.
- Pythia: A suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of ICML’23, pages 2397–2430, Honolulu, USA. JMLR.org.
- GPT-NeoX-20B: An open-source autoregressive language model. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, virtual+Dublin. Association for Computational Linguistics.
- Somministrazione delle prove invalsi dal 2009 al 2015: un patrimonio d’informazioni tra evidenze psicometriche e didattiche. In I dati INVALSI: uno strumento per la ricerca, page 14. Franco Angeli, Milano.
- Antonella Costanzo and Marta Desimoni. 2017. Beyond the mean estimate: a quantile regression analysis of inequalities in educational outcomes using invalsi survey data. Large-scale Assessments in Education.
- Neural learning for question answering in italian. In AI*IA 2018 – Advances in Artificial Intelligence, pages 389–402, Cham. Springer International Publishing.
- Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, volume 35, pages 30318–30332. Curran Associates, Inc.
- QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. J. Artif. Int. Res., 77.
- OLMo: Accelerating the Science of Language Models.
- Extremita at evalita 2023: Multi-task sustainable scaling to large language models at its extreme. In Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), Parma, Italy. CEUR.org.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Mistral 7b.
- Mixtral of experts.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv., 56(2).
- Reliability of the g factor over time in italian invalsi data (2010-2022): What can achievement-g tell us about the flynn effect? Personality and Individual Differences, 214:112345.
- Andrea Santilli and Emanuele Rodolà. 2023. Camoscio: An Italian Instruction-tuned LLaMA.
- Gabriele Sarti and Malvina Nissim. 2022. IT5: Large-scale text-to-text pretraining for italian language understanding and generation. ArXiv preprint 2203.03759.
- Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models.
- Llama: Open and efficient foundation language models.
- Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.