Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Measure the Intelligence of Large Language Models? (2407.20828v1)

Published 30 Jul 2024 in cs.AI and cs.LG

Abstract: With the release of ChatGPT and other LLMs the discussion about the intelligence, possibilities, and risks, of current and future models have seen large attention. This discussion included much debated scenarios about the imminent rise of so-called "super-human" AI, i.e., AI systems that are orders of magnitude smarter than humans. In the spirit of Alan Turing, there is no doubt that current state-of-the-art LLMs already pass his famous test. Moreover, current models outperform humans in several benchmark tests, so that publicly available LLMs have already become versatile companions that connect everyday life, industry and science. Despite their impressive capabilities, LLMs sometimes fail completely at tasks that are thought to be trivial for humans. In other cases, the trustworthiness of LLMs becomes much more elusive and difficult to evaluate. Taking the example of academia, LLMs are capable of writing convincing research articles on a given topic with only little input. Yet, the lack of trustworthiness in terms of factual consistency or the existence of persistent hallucinations in AI-generated text bodies has led to a range of restrictions for AI-based content in many scientific journals. In view of these observations, the question arises as to whether the same metrics that apply to human intelligence can also be applied to computational methods and has been discussed extensively. In fact, the choice of metrics has already been shown to dramatically influence assessments on potential intelligence emergence. Here, we argue that the intelligence of LLMs should not only be assessed by task-specific statistical metrics, but separately in terms of qualitative and quantitative measures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. Managing extreme ai risks amid rapid progress. Science, page eadn0117, 2024.
  2. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 13(1):3094, 2022.
  3. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2024.
  4. Melanie Mitchell. How do we know how smart ai systems are?, 2023.
  5. People cannot distinguish gpt-4 from a human in a turing test. arXiv preprint arXiv:2405.08007, 2024.
  6. Artificial intelligence index report 2024. 2024.
  7. Nicola Jones. Ai now beats humans at basic tasks—new benchmarks are needed, says major report. Nature, 628(8009):700–701, 2024.
  8. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  9. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  10. Pretraining data mixtures enable narrow model selection capabilities in transformer models. arXiv preprint arXiv:2311.00871, 2023.
  11. On the conversational persuasiveness of large language models: A randomized controlled trial. arXiv preprint arXiv:2403.14380, 2024.
  12. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
  13. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024.
  14. Dissociating language and thought in large language models. Trends in Cognitive Sciences, 2024.
  15. Reward is enough. Artificial Intelligence, 299:103535, 2021.
  16. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nils Körber (6 papers)
  2. Silvan Wehrli (3 papers)
  3. Christopher Irrgang (3 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com