Dice Question Streamline Icon: https://streamlinehq.com

Self-evaluation of language model capabilities

Determine how to measure the capabilities of a language model that are unknown a priori, including whether and how the language model itself can be used to evaluate its own capabilities or those of similar models, and establish externally verifiable procedures to ensure the validity of such measurements.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors caution against relying on a LLM to self-verify its own capabilities, highlighting risks inherent in using the subject of analysis as the evaluator. They pose a direct question about measuring capabilities that are not known in advance, underscoring a gap in methodology.

They suggest a potential direction—expanding a set of objectively verifiable characteristics and checking them externally—and note evidence that LM-as-judge methods can be unreliable or inconsistent across models, which motivates the need for robust, externally validated approaches to capability measurement.

References

How can a system measure capabilities we don't know it has?

Benchmarks as Microscopes: A Call for Model Metrology (2407.16711 - Saxon et al., 22 Jul 2024) in Limitations (and rebuttals), paragraph titled “LMs evaluating LMs?”