Support for Developers Prototyping LLM Evaluations
Develop and validate effective methods to support developers in prototyping evaluations for large language model pipelines, specifically in identifying evaluation criteria and implementing code-based or LLM-based assertions to automatically grade outputs for custom, real-world tasks where metrics are not pre-defined.
References
It thus remains unclear how to support developers in their prototyping of evaluations, with the problem becoming even more pressing as the popularity of prompt optimization increases.
— Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
(2404.12272 - Shankar et al., 18 Apr 2024) in Section 2 (Motivation and Related Work), Approaches to Aligning LLMs