Calibrating LLM-Based Evaluator (2309.13308v1)

Published 23 Sep 2023 in cs.CL

Abstract: Recent advancements in LLMs on LLMing and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation. However, hindered by the closed-source or high computational demand to host and tune, there is a lack of practice to further calibrate an off-the-shelf LLM-based evaluator towards better human alignment. In this work, we propose AutoCalibrate, a multi-stage, gradient-free approach to automatically calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Then, an initial set of scoring criteria is drafted by the LLM itself, leveraging in-context learning on different few-shot examples. To further calibrate this set of criteria, we select the best performers and re-draft them with self-refinement. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration. Our comprehensive qualitative analysis conveys insightful intuitions and observations on the essence of effective scoring criteria.

References (39)

Citations (22)

View on Semantic Scholar

Summary

The paper introduces AutoCalibrate, a novel calibration process that significantly improves alignment between LLM evaluations and human judgments.
It employs expert-labeled sample-score pairs and iterative criteria refinement to enhance evaluator robustness and performance.
Experimental results demonstrate that AutoCalibrate outperforms traditional metrics across diverse NLG tasks, including summarization and hallucination evaluation.

Calibrating LLM-Based Evaluator

The paper examines the calibration of LLM-based evaluators used in the assessment of natural language generation (NLG) quality. The authors introduce AutoCalibrate, a multi-stage, gradient-free approach designed to align LLM-based evaluators with human preferences more closely. This paper addresses notable gaps in current LLM-based evaluative practices, highlighting their sensitivity to format and prompts and their lack of alignment with human judgment standards due to ambiguous scoring criteria.

Key Methodology

AutoCalibrate employs a novel calibration process comprising several stages:

Data Labeling as Human Preference: Human preference is indirectly encoded through a set of expertly labeled sample-score pairs. This serves as the benchmark for aligning the LLM-based evaluator, ensuring that it mirrors human judgment more accurately.
Criteria Drafting: Utilizing the powerful instruction-following capacity of LLMs, a diverse initial set of scoring criteria is generated through few-shot in-context examples. This step emphasizes leveraging the inherent learning capabilities of LLMs to infer scoring criteria, which are crucial for evaluating NLG tasks without the need for extensive reference outputs.
Criteria Revisiting and Refinement: The criteria quality is enhanced through evaluation against expert labels, selecting top performers, and refining them using self-adjustment based on examples that previously led to differing human and model evaluations. The process leverages LLM's self-refinement capabilities to iteratively improve the scoring guidelines.

Experimental Evaluation

The AutoCalibrate framework was tested on various NLG tasks, including text summarization, data-to-text generation, and hallucination evaluation across datasets such as NewsRoom, SummEval, SFRES, SFHOT, and QAGS. The focus was on enhancing the correlation between LLM-generated scores and expert human evaluations.

On text summarization tasks, AutoCalibrate demonstrated substantial improvements over both traditional metrics like ROUGE and advanced LLM-based evaluations without calibration, indicating that explicit criteria considerably boost evaluator performance.
In data-to-text generation evaluation, the framework notably surpassed other model-based methods, illustrating its efficacy in aligning LLM evaluations with human judgment.
It also showed robustness across various datasets when evaluating hallucinations, further suggesting its applicability to different NLG evaluation contexts.

Implications and Future Directions

The findings underscore the potential of using LLMs as robust, reference-free evaluators once adequately calibrated. The gradient-free nature of AutoCalibrate supports its application in constrained environments where access to model weights or additional training is impractical.

Further research could explore extending AutoCalibrate to a broader range of language tasks, improving criteria induction techniques, or refining self-refinement strategies to iteratively enhance alignment further. The framework presents a foundational step toward more accurate and reliable automatic evaluation frameworks for NLG, leveraging state-of-the-art LLM capabilities aligned with human evaluative standards.

PDF Markdown

Related Papers

Tweets

https://twitter.com/slimshetty_/status/1764459273839644761

YouTube

Show All Videos