Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation (2306.10512v3)

Published 18 Jun 2023 in cs.CL

Abstract: As AI systems continue to grow, particularly generative models like LLMs, their rigorous evaluation is crucial for development and deployment. To determine their adequacy, researchers have developed various large-scale benchmarks against a so-called gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high computational costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Perspective, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time, tailoring the evaluation based on the model's ongoing performance instead of relying on a fixed test set. This paradigm not only provides a more robust ability estimation but also significantly reduces the number of test items required. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation. We propose that adaptive testing will become the new norm in AI model evaluation, enhancing both the efficiency and effectiveness of assessing advanced intelligence systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Yan Zhuang (62 papers)
  2. Qi Liu (485 papers)
  3. Yuting Ning (8 papers)
  4. Weizhe Huang (8 papers)
  5. Rui Lv (7 papers)
  6. Zhenya Huang (52 papers)
  7. Guanhao Zhao (5 papers)
  8. Zheng Zhang (486 papers)
  9. Qingyang Mao (8 papers)
  10. Shijin Wang (69 papers)
  11. Enhong Chen (242 papers)
  12. Zachary A. Pardos (18 papers)
  13. Patrick C. Kyllonen (2 papers)
  14. Jiyun Zu (2 papers)
Citations (23)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets