Construct validity of capability benchmarks (e.g., MMLU)
Evaluate and establish the construct validity of widely used capability benchmarks, such as determining whether performance on question answering (e.g., MMLU) genuinely indicates model understanding, and develop methodologies for validating benchmark-to-construct mappings.
Sponsor
References
For example, while MMLU claims to assess a model's understanding and memorization of knowledge from pretraining through the proxy of performance on question answering, it is unclear how well the ability to accurately answer questions serves as an indicator for understanding.
— Open Problems in Technical AI Governance
(2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.4.1 Downstream Impact Evaluations