Danoliteracy of Generative Large Language Models
Abstract: The language technology moonshot moment of Generative LLMs (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate \emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $\rho \sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.
- MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
- Elisabeth Arnbak and Carsten Elbro. 2000. Læsetekster for gymnasium, hf mv. Uddannelsesstyrelsens Internetpublikationer. Adgang 2000.
- Maurice S. Bartlett. 1951. The effect of standardization on a χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT approximation in factor analysis. Biometrika, 38(3/4):337–344.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300.
- Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
- DaNLP: An open-source toolkit for danish natural language processing. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970.
- Chatbot arena: An open platform for evaluating llms by human preference.
- Kenneth Enevoldsen et al. 2022. dfm-encoder-large-v1. A Transformer encoder model, part of the BERT family, intended for Danish natural language tasks.
- Measuring massive multitask language understanding. In Proceedings of the 9th International Conference on Learning Representations (ICLR).
- spaCy: Industrial-strength Natural Language Processing in Python. Software available from spacy.io.
- John L. Horn. 1965. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):179–185.
- DaNE: A named entity resource for Danish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4597–4604, Marseille, France. European Language Resources Association.
- David Ilić. 2023. Unveiling the general intelligence factor in language models: A psychometric approach.
- Rapport om udvikling og afprøvning af selvtest af læsning – en selvtest af voksnes læsefærdigheder på nettet.
- Henry F. Kaiser. 1960. The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1):141–151.
- Henry F. Kaiser. 1970. A second generation little jiffy. Psychometrika, 35:401–415.
- Oliver Kinch. 2023. Nordjylland news summarization. Hugging Face Datasets.
- Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 318–327, Singapore. Association for Computational Linguistics.
- Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710.
- Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Holistic evaluation of language models.
- HELM Lite: Lightweight and broad capabilities evaluation.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Nlebench+norglm: A comprehensive empirical analysis and benchmark dataset for generative language models in norwegian.
- Samuel R. Mathias. 2024. Horns: Horn’s parallel analysis in python. https://github.com/sammosummo/Horns.
- Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
- Dan Nielsen. 2023. ScandEval: A benchmark for Scandinavian natural language processing. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 185–201, Tórshavn, Faroe Islands. University of Tartu Library.
- Jesper Olsen. 2023. Hvor taler du flot dansk! selv skaberne er forbløffede over chatbottens sprogøre.
- OpenAI. 2023. evals. https://github.com/openai/evals. Accessed: 2023-11-22.
- siri.dk. 2023. Danskundervisning og prøver for udlændinge.
- Charles Spearman. 1904. "General Intelligence," Objectively Determined and Measured. The American Journal of Psychology, 15(2):201–292.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Gpt-ner: Named entity recognition via large language models.
- Transformers: State-of-the-art natural language processing. https://github.com/huggingface/transformers. Hugging Face, Brooklyn, USA.
- Omry Yadan. 2019. Hydra - a framework for elegantly configuring complex applications. Github.
- Marc Zao-Sanders. 2024. How people are really using genai. Harvard Business Review. Technology and analytics.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Judging llm-as-a-judge with mt-bench and chatbot arena.
- Starling-7b: Improving llm helpfulness and harmlessness with rlaif.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.