From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation (2306.10512v3)
Abstract: As AI systems continue to grow, particularly generative models like LLMs, their rigorous evaluation is crucial for development and deployment. To determine their adequacy, researchers have developed various large-scale benchmarks against a so-called gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high computational costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Perspective, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time, tailoring the evaluation based on the model's ongoing performance instead of relying on a fixed test set. This paradigm not only provides a more robust ability estimation but also significantly reduces the number of test items required. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation. We propose that adaptive testing will become the new norm in AI model evaluation, enhancing both the efficiency and effectiveness of assessing advanced intelligence systems.
- Yan Zhuang (62 papers)
- Qi Liu (485 papers)
- Yuting Ning (8 papers)
- Weizhe Huang (8 papers)
- Rui Lv (7 papers)
- Zhenya Huang (52 papers)
- Guanhao Zhao (5 papers)
- Zheng Zhang (486 papers)
- Qingyang Mao (8 papers)
- Shijin Wang (69 papers)
- Enhong Chen (242 papers)
- Zachary A. Pardos (18 papers)
- Patrick C. Kyllonen (2 papers)
- Jiyun Zu (2 papers)