Support for Developers Prototyping LLM Evaluations

Develop and validate effective methods to support developers in prototyping evaluations for large language model pipelines, specifically in identifying evaluation criteria and implementing code-based or LLM-based assertions to automatically grade outputs for custom, real-world tasks where metrics are not pre-defined.

Background

Prompt engineering and LLM auditing practices rely on automated evaluation metrics—implemented as code or LLM-based evaluators—to score model outputs. Much existing work in optimization and calibration assumes benchmark datasets and settled metrics, which may not reflect the realities developers face when prototyping evaluations for bespoke applications.

The paper argues that fully automated tools can produce assertions for criteria that humans may not care about, and that alignment with user preferences is challenging in the wild. Consequently, there is a need for approaches that directly support practitioners in the iterative, ad hoc process of defining criteria and building evaluators for their specific pipelines.

References

It thus remains unclear how to support developers in their prototyping of evaluations, with the problem becoming even more pressing as the popularity of prompt optimization increases.

— Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences (2404.12272 - Shankar et al., 18 Apr 2024) in Section 2 (Motivation and Related Work), Approaches to Aligning LLMs

Support for Developers Prototyping LLM Evaluations

Sponsor

Background

References

Related Problems