Self-evaluation of language model capabilities
Determine how to measure the capabilities of a language model that are unknown a priori, including whether and how the language model itself can be used to evaluate its own capabilities or those of similar models, and establish externally verifiable procedures to ensure the validity of such measurements.
References
How can a system measure capabilities we don't know it has?
— Benchmarks as Microscopes: A Call for Model Metrology
(2407.16711 - Saxon et al., 22 Jul 2024) in Limitations (and rebuttals), paragraph titled “LMs evaluating LMs?”