Construct validity of capability benchmarks (e.g., MMLU)

Evaluate and establish the construct validity of widely used capability benchmarks, such as determining whether performance on question answering (e.g., MMLU) genuinely indicates model understanding, and develop methodologies for validating benchmark-to-construct mappings.

Background

Benchmarks are used as proxies for complex constructs like understanding, yet the link between a score and the targeted construct may be unproven.

Demonstrating construct validity would improve confidence in benchmark-based evaluations and their use in governance decisions.

References

For example, while MMLU claims to assess a model's understanding and memorization of knowledge from pretraining through the proxy of performance on question answering, it is unclear how well the ability to accurately answer questions serves as an indicator for understanding.

Open Problems in Technical AI Governance (2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.4.1 Downstream Impact Evaluations