Papers
Topics
Authors
Recent
Search
2000 character limit reached

Danoliteracy of Generative Large Language Models

Published 30 Oct 2024 in cs.CL, cs.AI, and cs.LG | (2410.22839v2)

Abstract: The language technology moonshot moment of Generative LLMs (GLLMs) was not limited to English: These models brought a surge of technological applications, investments, and hype to low-resource languages as well. However, the capabilities of these models in languages such as Danish were, until recently, difficult to verify beyond qualitative demonstrations due to a lack of applicable evaluation corpora. We present a GLLM benchmark to evaluate \emph{Danoliteracy}, a measure of Danish language and cultural competency across eight diverse scenarios such as Danish citizenship tests and abstractive social media question answering. This limited-size benchmark was found to produce a robust ranking that correlates to human feedback at $\rho \sim 0.8$ with GPT-4 and Claude Opus models achieving the highest rankings. Analyzing these model results across scenarios, we find one strong underlying factor explaining $95\%$ of scenario performance variance for GLLMs in Danish, suggesting a $g$ factor of model consistency in language adaptation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
  2. Elisabeth Arnbak and Carsten Elbro. 2000. Læsetekster for gymnasium, hf mv. Uddannelsesstyrelsens Internetpublikationer. Adgang 2000.
  3. Maurice S. Bartlett. 1951. The effect of standardization on a χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT approximation in factor analysis. Biometrika, 38(3/4):337–344.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  5. Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300.
  6. Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
  7. DaNLP: An open-source toolkit for danish natural language processing. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
  8. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 961–970.
  9. Chatbot arena: An open platform for evaluating llms by human preference.
  10. Kenneth Enevoldsen et al. 2022. dfm-encoder-large-v1. A Transformer encoder model, part of the BERT family, intended for Danish natural language tasks.
  11. Measuring massive multitask language understanding. In Proceedings of the 9th International Conference on Learning Representations (ICLR).
  12. spaCy: Industrial-strength Natural Language Processing in Python. Software available from spacy.io.
  13. John L. Horn. 1965. A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2):179–185.
  14. DaNE: A named entity resource for Danish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4597–4604, Marseille, France. European Language Resources Association.
  15. David Ilić. 2023. Unveiling the general intelligence factor in language models: A psychometric approach.
  16. Rapport om udvikling og afprøvning af selvtest af læsning – en selvtest af voksnes læsefærdigheder på nettet.
  17. Henry F. Kaiser. 1960. The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20(1):141–151.
  18. Henry F. Kaiser. 1970. A second generation little jiffy. Psychometrika, 35:401–415.
  19. Oliver Kinch. 2023. Nordjylland news summarization. Hugging Face Datasets.
  20. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 318–327, Singapore. Association for Computational Linguistics.
  21. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710.
  22. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  23. Holistic evaluation of language models.
  24. HELM Lite: Lightweight and broad capabilities evaluation.
  25. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  26. Nlebench+norglm: A comprehensive empirical analysis and benchmark dataset for generative language models in norwegian.
  27. Samuel R. Mathias. 2024. Horns: Horn’s parallel analysis in python. https://github.com/sammosummo/Horns.
  28. Hiroki Nakayama. 2018. seqeval: A python framework for sequence labeling evaluation. Software available from https://github.com/chakki-works/seqeval.
  29. Dan Nielsen. 2023. ScandEval: A benchmark for Scandinavian natural language processing. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 185–201, Tórshavn, Faroe Islands. University of Tartu Library.
  30. Jesper Olsen. 2023. Hvor taler du flot dansk! selv skaberne er forbløffede over chatbottens sprogøre.
  31. OpenAI. 2023. evals. https://github.com/openai/evals. Accessed: 2023-11-22.
  32. siri.dk. 2023. Danskundervisning og prøver for udlændinge.
  33. Charles Spearman. 1904. "General Intelligence," Objectively Determined and Measured. The American Journal of Psychology, 15(2):201–292.
  34. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
  35. Gpt-ner: Named entity recognition via large language models.
  36. Transformers: State-of-the-art natural language processing. https://github.com/huggingface/transformers. Hugging Face, Brooklyn, USA.
  37. Omry Yadan. 2019. Hydra - a framework for elegantly configuring complex applications. Github.
  38. Marc Zao-Sanders. 2024. How people are really using genai. Harvard Business Review. Technology and analytics.
  39. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  40. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  41. Judging llm-as-a-judge with mt-bench and chatbot arena.
  42. Starling-7b: Improving llm helpfulness and harmlessness with rlaif.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.