The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian (2403.18697v3)

Published 27 Mar 2024 in cs.CL

Abstract: While Italian is a high-resource language, there are few Italian-native benchmarks to evaluate generative LLMs in this language. This work presents three new benchmarks: Invalsi MATE to evaluate models performance on mathematical understanding in Italian, Invalsi ITA to evaluate language understanding in Italian and Olimpiadi MATE for more complex mathematical understanding. The first two benchmarks are based on the Invalsi tests, which are administered to students of age between 6 and 18 within the Italian school system and have been validated by several experts in teaching and pedagogy, the third one comes from the Italian high school math Olympics. We evaluate 10 powerful LLMs on these benchmarks and find that they are bound by 71% accuracy on Invasli MATE, achieved by Llama 3.1 70b instruct and by 88% on Invalsi ITA. For both Invalsi MATE and Invalsi ITA we compare LLMs with the average performance of Italian students to show that Llama 3.1 is the only one to outperform them on Invalsi MATE while most models do so on Invalsi ITA, we then show that Olimpiadi MATE is more challenging than Invalsi MATE and the highest accuracy, achieved by Llama 3.1 405b instruct is 45%. We will make data and evaluation code openly available upon acceptance of the paper.

References (25)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

The Invalsi Benchmarks: measuring Linguistic and Mathematical understanding of Large Language Models in Italian (2403.18697v3)

Summary

Related Papers