Construct validity of QA benchmarks as measures of understanding
Establish whether performance on question‑answering benchmarks such as MMLU validly measures a model’s underlying understanding rather than superficial behavior, and characterize the conditions under which QA performance serves as a reliable indicator of understanding.
References
For example, while MMLU claims to assess a model's understanding and memorization of knowledge from pretraining through the proxy of performance on question answering, it is unclear how well the ability to accurately answer questions serves as an indicator for understanding.
— Open Problems in Technical AI Governance
(2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.4.1 “Downstream Impact Evaluations”